Overview
You can download your ARCH datasets to work with them locally from the command line, desktop, or other web-based programs. Follow the instructions below to download each dataset from the Internet Archive's online storage and read its full contents.
When you are ready to explore the contents of datasets with additional analysis and visualization tools, see the tutorial guide: Sample ARCH datasets and how to explore them.
Instructions
On this page:
Downloading files
Each dataset's detail page includes a Dataset(s) heading from which its full contents can be downloaded to local or network storage. Click the cloud icon at the right-hand side of the table to download each file:
WAT, named entity, and LGA files
Web Archive Transformation (WAT), named entity, and Longitudinal Graph Analysis (LGA) datasets contain one file of JSON formatted data per WARC file in the collection, so they must be downloaded in bulk using the Web Archiving Systems API (WASAPI) rather than individually from ARCH.
Each WAT, named entity, or LGA dataset detail page includes a URL to find its files and the instructions to download them in bulk from a command line interface:
Dependencies
These instructions utilize your computer's built-in command line curl program to locate files, jq to compile a list of files to download, and wget to download the files. Make sure that all of these dependencies are installed on your computer to download the files from storage with WASAPI.
Opening files
Each ARCH dataset file is compressed for storage with Gzip. Begin by uncompressing the file, then open and read its contents it with a desktop or command line program recommended below:
Desktop
Unzip all files using your computer's built-in archiving utility or The Unarchiver (Mac), Keka (Mac), 7zip (PC) or WinZip (PC).
Unzipped datasets in the CSV file format may be rendered by a spreadsheet program like Microsoft Excel, Google Sheets, or OpenOffice Calc.
To read the JSON formatted data from WAT, named entity, and LGA files, use your computer's built-in editor like TextEdit (Mac) or Notepad (PC) or use an editor program like Sublime Text.
Command line
Any file can be uncompressed in a Linux or Unix environment by running the gunzip command:
gunzip my-file.csv.gz
Uncompressed files in the CSV file format may be explored from the command line as DataFrames, similar to a spreadsheet, using a program like pandas, dask, or polars. For a demonstration of DataFrames using pandas, see the tutorial: Explore web archive data from the command line with Jupyter Notebooks.
More help and information
🙋 Having trouble? Submit a support ticket here. |
Comments
0 comments
Please sign in to leave a comment.