ARCH Text datasets

Karl Blumenthal

Updated January 15, 2026 15:24

Overview

ARCH text datasets allow the user to find, extract, and analyze the text contents of documents in a collection. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, word frequency, collocation, n-grams, topic modelling, geoparsing, and word differences.

On this page:

Datasets
Example uses
Tool recommendations

Datasets

Extracted text

Text extracted from collection documents in PDF, DOC, TXT, or other text-bearing file formats. Output: one JSONL file comprising a JSON object for each input record. Values include filename and path, MIME type, and extracted text. 📥 Download an example. 📚 Read the code.

Named entities

Named entities datasets contain the persons, organizations, geographic locations, and dates from text-bearing resource in a collection, organized in JSON format by source document. ARCH provides named entities in English or Chinese from web pages and transcribed from images and speech currently. 📚 Learn more and see examples.

Plain text of web pages

📥 Download an example. 📚 Read the code. This dataset lists the location, technical metadata, and extracted full text contents of each HTML web page or otherwise text-bearing format of web document within a web archive collection. The CSV file presents data in the following columns:

crawl_date

type: String
format: yyyyMMddHHss
example: 201610140439
description: The timestamp representing when the document was collected from the live web.
learn more: Timestamp

last_modified_date

type: String
format: yyyyMMddHHss
example: 201211041756
description: The timestamp, when available, reported by the live web server at collection time to represent when the document was last changed on the live web.
learn more: Last modified date

domain

type: String
format: sld.tld
example: archive.org
description: The host site serving the document on the live web at the time of collection.
learn more: Domain

url

type: String
format: protocol://hostname/directory/filename
example: https://archive.org/about/terms.php
description: The location of the document on the live web at the time of collection.
learn more: URL

mime_type_web_server

type: String
format: type/format
example: text/html
description: The format of the document as reported by the live web server at collection time.
learn more: MIME

mime_type_tika

type: String
format: type/format
example: text/html
description: The format of the document as reported by ARCH, running Apache Tika.
learn more: Apache Tika

language

type: String
format: ISO 639-1
example: en
description: The language of the document's body content as reported by ARCH, running Apache Tika.
learn more: Apache Tika

content

type: String
example: Hello, world!
description: The plain text body content of the HTML or text-bearing web document with markup extracted.

Text files dataset

The text files dataset lists locations, metadata, and extracted text contents for text-based web documents in the following formats: CSS, JSON, XML, plain text, JS, and HTML. This dataset is packaged as a CSV file with the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, md5 and sha1 checksum values, and content. 📚 Read the code.

Example uses

Queer Webcomics Archives Project - an Archives Unleashed Datathon project that explored the representation of queer identities within the Global Webcomics web archive collection. A variety of methods and visualizations were produced while analyzing text and domains datasets.
Autism Discourse in the U.S: An Exploratory Analysis - another Archives Unleashed datathon project that explored the discourse among autism bloggers in the Autism and Alzheimers web archive collection, using two primary modes of analysis: sentiment analysis (using NLTK) and network analysis (Gephi).
#teamwildfyre - an Archives Unleashed datathon project that explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on named entity recognition (NER) and geocoding.
Creating Collection Growth Curves with AUT and Hypercane - these blog posts describe the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."

Example text sentiment analysis from the Autism and Alzheimers web archive collection

Tool recommendations

The following software and tool libraries can be used to parse and analyze the text extracted from collections.

To learn more and practice exploring ARCH text datasets, see our tutorials:

Natural language processing (NLP)

AntConc - Desktop software for natural language processing (NLP).
nltk (Natural Language Toolkit) - Widely supported command line toolset for NLP written in the Python programming language.
spaCy - Command line toolset for NLP, named entity recognition, and topic modeling written in Python.
Voyant - Browser-based NLP platform with web-enabled analysis and visualization modules optimized for digital humanities applications.

Named entity recognition

NER (Stanford Named Entity Recognition) - Command line tool written in Java for labeling recognized word sequences as persons, organizations, or locations. This is the tool that ARCH uses to create named entity datasets but it may also be used on other text datasets directly.

Topic modeling

BERTopic - Richly featured command line toolset for topic modeling written in Python.
CorEx Topic Model - Command line tool for topic modeling written in Python and optimized for guided domain specificity through the use of "anchor words."
MALLET - Java-based command line package for topic modeling and other machine learning applications.

Arch-dataset-example_Text.csv
(2 KB)
ARCH-dataset-example_speech.jsonl
(20 KB)
ARCH-dataset-example_ocr.jsonl
(404 Bytes)
ARCH-dataset-example_extracted-text.json
(4 KB)