How to download and open ARCH datasets

Overview

You can download your ARCH datasets to work with them locally from the command line, desktop, or other web-based programs. Follow the instructions below to download each dataset from the Internet Archive's online storage and read its full contents.

When you are ready to explore the contents of datasets with additional analysis and visualization tools, see the tutorial guide: Sample ARCH datasets and how to explore them.

Instructions

On this page:

  1. Downloading files
  2. Opening files
  3. More help and information

Downloading files

Each dataset's detail page includes a Dataset(s) heading from which its full contents can be downloaded to local or network storage. Click the cloud icon at the right-hand side of the table to download each file:

ARCH-Datasets_Download-2.png

WAT, named entity, and LGA files

Web Archive Transformation (WAT), named entity, and Longitudinal Graph Analysis (LGA) datasets contain one file of JSON formatted data per WARC file in the collection, so they must be downloaded in bulk using the Web Archiving Systems API (WASAPI) rather than individually from ARCH.

Each WAT, named entity, or LGA dataset detail page includes a URL to find its files and the instructions to download them in bulk from a command line interface:

ARCH-Datasets_Download-ARS.png

Dependencies

These instructions utilize your computer's built-in command line curl program to locate files, jq to compile a list of files to download, and wget to download the files. Make sure that all of these dependencies are installed on your computer to download the files from storage with WASAPI.

Opening files

Each ARCH dataset file is compressed for storage with Gzip. Begin by uncompressing the file, then open and read its contents it with a desktop or command line program recommended below:

Desktop

Unzip all files using your computer's built-in archiving utility or The Unarchiver (Mac), Keka (Mac), 7zip (PC) or WinZip (PC).

Unzipped datasets in the CSV file format may be rendered by a spreadsheet program like Microsoft Excel, Google  Sheets, or OpenOffice Calc.

To read the JSON formatted data from WAT, named entity, and LGA files, use your computer's built-in editor like TextEdit (Mac) or Notepad (PC) or use an editor program like Sublime Text.

Command line

Any file can be uncompressed in a Linux or Unix environment by running the gunzip command:

gunzip my-file.csv.gz

Uncompressed files in the CSV file format may be explored from the command line as DataFrames, similar to a spreadsheet, using a program like pandas, dask, or polars. For a demonstration of DataFrames using pandas, see the tutorial: Explore web archive data from the command line with Jupyter Notebooks.

More help and information

🙋 Having trouble? Submit a support ticket here.

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.