ARCH Text datasets

Overview

ARCH text datasets allow the user to find, extract, and analyze the text contents of documents in a collection. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, word frequency, collocation, n-grams, topic modelling, geoparsing, and word differences.

On this page:

Datasets

Named entities

Named entities datasets contain the persons, organizations, geographic locations, and dates from text-bearing resource in a collection, organized in JSON format by source document. ARCH provides named entities in English or Chinese from web pages and transcribed from images and speech currently. πŸ“š Learn more and see examples.

Plain text of web pages

This dataset lists the location, technical metadata, and extracted full text contents of each HTML web page or otherwise text-bearing format of document within a web archive collection. The CSV file presents data in the following columns: crawl_date, last_modified_date,Β domain, url, mime_type as provided by the web server and as detected by Apache TIKA, and content. πŸ“₯ Download an example. πŸ“š Read the code.

Speech recognition

This dataset includes the text transcribed from speech in collection audio and video documents. It uses the Whisper model to perform automatic speech recognition (ASR). Output: one or more JSONL files comprising a JSON object for each input record. πŸ“₯ Download an example. πŸ“š Read the code.

Text files datasetΒ 

The text files dataset lists locations, metadata, and extracted text contents for text-based web documents in the following formats: CSS, JSON, XML, plain text, JS, and HTML. This dataset is packaged as a CSV file with the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, md5 and sha1 checksum values, and content. πŸ“š Read the code.

Text recognition

This dataset includes the text recognized and transcribed from images in a collection, including handwriting. It uses the TrOCR model to process text and perform optical character recognition (OCR). Output: one or more JSONL files comprising a JSON object for each input record. πŸ“₯ Download an example. πŸ“š Read the code.

Example uses

  1. Queer Webcomics Archives Project - an Archives Unleashed Datathon project that explored the representation of queer identities within the Global Webcomics web archive collection. A variety of methods and visualizations were produced while analyzing text and domains datasets.

  2. Autism Discourse in the U.S: An Exploratory Analysis - another Archives Unleashed datathon project that explored the discourse among autism bloggers in the Autism and Alzheimers web archive collection, using two primary modes of analysis: sentiment analysis (using NLTK) and network analysis (Gephi).

  3. #teamwildfyre - an Archives Unleashed datathon project that explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on named entity recognition (NER) and geocoding.Β 

  4. Creating Collection Growth Curves with AUT and Hypercane - these blog posts describe the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."

Austism_text.png

Example text sentiment analysis from the Autism and Alzheimers web archive collection

Β 

Tool recommendations

The following software and tool libraries can be used to parse and analyze the text extracted from web archive collections.

To learn more and practice exploring ARCH text datasets, see our tutorials:

Natural language processing (NLP)

Named entity recognition

  • NER (Stanford Named Entity Recognition) - Command line tool written in Java for labeling recognized word sequences as persons, organizations, or locations. This is the tool that ARCH uses to create named entity datasets but it may also be used on other text datasets directly.

Topic modeling

  • BERTopic - Richly featured command line toolset for topic modeling written in Python.

  • CorEx Topic Model - Command line tool for topic modeling written in Python and optimized for guided domain specificity through the use of "anchor words."

  • MALLET - Java-based command line package for topic modeling and other machine learning applications.
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.