Overview
ARCH image datasets contain information to find, describe, use, and/or transcribe image files in any collection.Β
On this page:
Datasets
Image file information
Locations and metadata for JPEG, PNG, GIF, and other image formatted files in the collection. Output: one CSV with columns for crawl date, last modified date, URL, file name, file format extension, MIME type as reported by the web server and as detected by Apache TIKA, and MD5 and SHA1 hash values. π₯ Download an example. π Read the code.
Image graph
Timestamp, location, and any original description for each image file in the collection. Output: one CSV with columns for crawl date, source page, image file url, and alt text. π₯ Download an example. π Read the code.
Text recognition
Text recognized and transcribed from images in a collection, including handwriting. Output: one JSONL file comprising a JSON object for each input record. Values include filename and path, MIME type, and transcription. π₯ Download an example. π Read the code.
Example uses
-
Non-textual content in the DC Punk web archive (PDF) - This Archives Unleashed Datathon project explored non-textual elements, specifically audio and video objects. Project members used data to download images and explore file type frequencies.
ΒExample gallery projection of images found in the D.C. Punk (Web) Archive collection.
Tool recommendations
Extracting files
- wget - Widely supported command line tool written in C for downloading lists of files from the web. For our guide to extracting files from archival storage locations, see: Extracting files.
Image analysis
Palladio
Palladio is a free, open source suite of tools for representing tabular datasets as graphs, maps, and galleries. It is hosted on the web and maintained by the Humanities + Design research laboratory at Stanford University. See our sample ARCH data tutorials to practice short image dataset explorations using Palladio.
Juxta
Juxta is a shell script for generating image collages to embed in webpages. The Archives Unleashed project created this helpful guide to creating a Juxta collage with data from the Image Graph derivative file: Creating Juxta Collages with Web Archive Images.
Text analysis
See the tool recommendations for ARCH text datasets for tools that can be used on the text recognized among images in the Text recognition dataset.
Β
Comments
Please sign in to leave a comment.