ARCH Image datasets

Karl Blumenthal

Updated August 05, 2025 23:47

Overview

ARCH image datasets contain information to find, describe, use, and/or transcribe image files in any collection.

On this page:

Datasets
Example uses
Tool recommendations

Datasets

Image file information

Locations and metadata for JPEG, PNG, GIF, and other image formatted files in the collection. Output: one CSV with columns for crawl date, last modified date, URL, file name, file format extension, MIME type as reported by the web server and as detected by Apache TIKA, and MD5 and SHA1 hash values. 📥 Download an example. 📚 Read the code.

Image graph

Timestamp, location, and any original description for each image file in the collection. Output: one CSV with columns for crawl date, source page, image file url, and alt text. 📥 Download an example. 📚 Read the code.

Text recognition

Text recognized and transcribed from images in a collection, including handwriting. Output: one JSONL file comprising a JSON object for each input record. Values include filename and path, MIME type, and transcription. 📥 Download an example. 📚 Read the code.

Example uses

Non-textual content in the DC Punk web archive (PDF) - This Archives Unleashed Datathon project explored non-textual elements, specifically audio and video objects. Project members used data to download images and explore file type frequencies.

Example gallery projection of images found in the D.C. Punk (Web) Archive collection.

Tool recommendations

Extracting files

wget - Widely supported command line tool written in C for downloading lists of files from the web. For our guide to extracting files from archival storage locations, see: Extracting files.

Image analysis

Palladio

Palladio is a free, open source suite of tools for representing tabular datasets as graphs, maps, and galleries. It is hosted on the web and maintained by the Humanities + Design research laboratory at Stanford University. See our sample ARCH data tutorials to practice short image dataset explorations using Palladio.

Juxta

Juxta is a shell script for generating image collages to embed in webpages. The Archives Unleashed project created this helpful guide to creating a Juxta collage with data from the Image Graph derivative file: Creating Juxta Collages with Web Archive Images.

Text analysis

See the tool recommendations for ARCH text datasets for tools that can be used on the text recognized among images in the Text recognition dataset.