Overview
ARCH text datasets allow the user to find, extract, and analyze the text contents of HTML pages and other documents in a web archive. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, word frequency, collocation, n-grams (word and phrase frequency over time), topic modelling, geoparsing, and word differences.
On this page:
Datasets
Named entities
Named entities datasets contain the persons, organizations, geographic locations, and dates from each text-bearing resource in a web archive collection, organized in JSON format by originating URL and timestamp. ARCH provides named entities in English or Chinese currently. 📚 Learn more and see examples.
Plain text of web pages
This dataset lists the location, technical metadata, and extracted full text contents of each HTML web page or otherwise text-bearing format of document within a web archive collection. The CSV file presents data in the following columns: crawl_date, last_modified_date, domain, url, mime_type as provided by the web server and as detected by Apache TIKA, and content. 📥 Download an example. 📚 Read the code.
Text files dataset
The text files dataset lists locations, metadata, and extracted text contents for text-based web documents in the following formats: CSS, JSON, XML, plain text, JS, and HTML. This dataset is packaged as a CSV file with the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, md5 and sha1 checksum values, and content. 📚 Read the code.
Example uses
- Queer Webcomics Archives Project - an Archives Unleashed Datathon project that explored the representation of queer identities within the Global Webcomics web archive collection. A variety of methods and visualizations were produced while analyzing text and domains datasets.
- Autism Discourse in the U.S: An Exploratory Analysis - another Archives Unleashed datathon project that explored the discourse among autism bloggers in the Autism and Alzheimers web archive collection, using two primary modes of analysis: sentiment analysis (using NLTK) and network analysis (Gephi).
- #teamwildfyre - an Archives Unleashed datathon project that explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on named entity recognition (NER) and geocoding.
- Creating Collection Growth Curves with AUT and Hypercane - these blog posts describe the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."
Example text sentiment analysis from the Autism and Alzheimers web archive collection
Tool recommendations
The following software and tool libraries can be used to parse and analyze the text extracted from web archive collections.
To learn more and practice exploring ARCH text datasets, see our tutorials:
- How to mine text from a web archive collection with Voyant
- Explore web archive data from the command line with Jupyter Notebooks
Natural language processing (NLP)
- AntConc - Desktop software for natural language processing (NLP).
- nltk (Natural Language Toolkit) - Widely supported command line toolset for NLP written in the Python programming language.
- spaCy - Command line toolset for NLP, named entity recognition, and topic modeling written in Python.
- Voyant - Browser-based NLP platform with web-enabled analysis and visualization modules optimized for digital humanities applications.
Named entity recognition
- NER (Stanford Named Entity Recognition) - Command line tool written in Java for labeling recognized word sequences as persons, organizations, or locations. This is the tool that ARCH uses to create named entity datasets but it may also be used on other text datasets directly.
Topic modeling
- BERTopic - Richly featured command line toolset for topic modeling written in Python.
- CorEx Topic Model - Command line tool for topic modeling written in Python and optimized for guided domain specificity through the use of "anchor words."
- MALLET - Java-based command line package for topic modeling and other machine learning applications.
Comments
0 comments
Please sign in to leave a comment.