ARCH File format datasets

Overview

ARCH file format datasets contain information to find, describe, and use the files contained in a web archive. They may extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.

On this page:

Datasets

Audio information

This dataset includes information about files from a web archive collection in audio formats like MP3, WAV, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

Image information

This dataset includes information about files from a web archive collection in image formats like GIF, JPEG, PNG, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, width, height, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

PDF file information

This dataset includes information about files from a web archive collection in the portable document format (PDF). A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

Presentation file information

This dataset includes information about files from a web archive collection from presentation programs like PowerPoint, Keynote, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

Spreadsheet file information

This dataset includes information about files from a web archive collection in spreadsheet formats like CSV, XLS, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📚 Read the code.

Video file information

This dataset includes information about files from a web archive collection in video formats like MP4, MOV, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

Word processing file information

This dataset includes information about files from a web archive collection in word processing document formats like DOC, RTF, and more. A CSV presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.

Example use

Extracting files

ARCH file format datasets do not contain the files themselves, but you can use a dataset to download the files to your preferred local or networked storage. Follow the instructions below to create a list of archival storage URLs and download the original files with a tool like wget.

Each item in a file format dataset includes two attributes that are necessary to find and download it from archival storage: crawl_date and url. These can be combined with a standard Wayback replay prefix to create an archival URL the following format: prefix/crawl_date/url.

Begin by defining your preferred Wayback prefix based on the archival storage repository from which you intend to download:

  • An Archive-It collection: https://wayback.archive-it.org/{collection #}/
  • All Archive-It collections: https://wayback.archive-it.org/all/
  • The Internet Archive's Wayback Machine: https://web.archive.org/web/

Append the crawl_date value and a trailing slash to the end of your Wayback prefix, followed by the original url value.

For example, given the following data from the Life of Aaron Swartz web archive collection's PDF information dataset:

crawl_date url
20130113 http://www.wired.com/images_blogs/threatlevel/2012/09/swartzsuperseding.pdf

An archival URL to retrieve to original PDF document from the collection would be:

https://wayback.archive-it.org/3492/20130113/http://www.wired.com/images_blogs/threatlevel/2012/09/swartzsuperseding.pdf

Tool recommendations

Extracting files

  • wget - Widely supported command line tool written in C for downloading lists of files from the web.

Text analysis

See the tool recommendations for ARCH text datasets for tools that can be used on the text extracted from files in PDF file information and Word processing file information datasets.

Optical character recognition (OCR)

Programming Historian's tutorial Working with batches of PDF files covers how to perform OCR and text extraction with Tesseract or Poppler and how to get an overview of large numbers of PDF documents using topic modelling.

 

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.