Overview
ARCH file format datasets contain information to find, describe, and use the files contained in a web archive. They may extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.
On this page:
Datasets
Audio information
This dataset includes information about files from a web archive collection in audio formats like MP3, WAV, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
Image information
This dataset includes information about files from a web archive collection in image formats like GIF, JPEG, PNG, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, width, height, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
PDF file information
This dataset includes information about files from a web archive collection in the portable document format (PDF). A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
Presentation file information
This dataset includes information about files from a web archive collection from presentation programs like PowerPoint, Keynote, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
Spreadsheet file information
This dataset includes information about files from a web archive collection in spreadsheet formats like CSV, XLS, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📚 Read the code.
Video file information
This dataset includes information about files from a web archive collection in video formats like MP4, MOV, and more. A CSV file presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
Word processing file information
This dataset includes information about files from a web archive collection in word processing document formats like DOC, RTF, and more. A CSV presents data in the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, and md5 and sha1 checksum values. 📥 Download an example. 📚 Read the code.
Example use
- Non-textual content in the DC Punk web archive - this Archives Unleashed Datathon project explored non-textual elements, specifically audio and video objects. Project members used data to download images and explore file type frequencies.
Example gallery projection of images found in the D.C. Punk (Web) Archive collection.
Extracting files
ARCH file format datasets do not contain the files themselves, but you can use a dataset to download the files to your preferred local or networked storage. Follow the instructions below to create a list of archival storage URLs and download the original files with a tool like wget.
Each item in a file format dataset includes two attributes that are necessary to find and download it from archival storage: crawl_date and url. These can be combined with a standard Wayback replay prefix to create an archival URL the following format: prefix/crawl_date/url.
Begin by defining your preferred Wayback prefix based on the archival storage repository from which you intend to download:
- An Archive-It collection: https://wayback.archive-it.org/{collection #}/
- All Archive-It collections: https://wayback.archive-it.org/all/
- The Internet Archive's Wayback Machine: https://web.archive.org/web/
Append the crawl_date value and a trailing slash to the end of your Wayback prefix, followed by the original url value.
For example, given the following data from the Life of Aaron Swartz web archive collection's PDF information dataset:
crawl_date | url |
20130113 |
http://www.wired.com/images_blogs/threatlevel/2012/09/swartzsuperseding.pdf |
An archival URL to retrieve to original PDF document from the collection would be:
Tool recommendations
Extracting files
- wget - Widely supported command line tool written in C for downloading lists of files from the web.
Text analysis
See the tool recommendations for ARCH text datasets for tools that can be used on the text extracted from files in PDF file information and Word processing file information datasets.
Optical character recognition (OCR)
Programming Historian's tutorial Working with batches of PDF files covers how to perform OCR and text extraction with Tesseract or Poppler and how to get an overview of large numbers of PDF documents using topic modelling.
Comments
0 comments
Please sign in to leave a comment.