Quick guide to using ARCH

Karl Blumenthal

Updated December 03, 2025 01:22

Introduction

ARCH (Archives Research Compute Hub) is a research and education service that helps users build, access, and analyze digital collections computationally at scale. ARCH is configured currently to produce research-ready datasets from collections in Archive.org, Archive-It, Vault, and the Wayback Machine. Follow the instructions below to find your collections, create and share new datasets, and get technical support when you need it.

ℹ️ Need access to ARCH? Start here to reach our team.

ARCH was made possible in part by funding from the Mellon Foundation and via a long-running collaboration with the Archives Unleashed Project of the University of Waterloo and York University. Developing ARCH to support more user, their collection, and datasets types is made possible with funding from the Institute of Museum and Library Services (LG-254878-OLS-23).

On this page:

Accessing your ARCH account
Browsing your web archive collections
Creating ARCH datasets
Advanced use and tutorials

Accessing your ARCH account

🔑 Login to your ARCH account here: https://arch.archive-it.org/.

Your ARCH account provides quick access to a dashboard of your source collections and recently created datasets:

Browsing your collections

Click on the Collections link in the top navigation bar anytime to view a complete list of collections to which you have ARCH access:

The table includes the following information about each collection:

Name: Name of the collection and link to view its full details.
Type: Source of the collection--from Archive.org, Archive-It, or a custom creation.
Latest Dataset: The dataset created most recently from this collection and a link to view its full details.
Dataset Date: The date on which the collection's most recent dataset was created.
Size: The full data volume of the collection (not the latest dataset).

Click on the checkbox next to any collection, followed by the Generate Dataset button to create a new dataset anytime.

Collection details

Click on the name of any collection in your table to see the collection's full details and datasets:

Each collection includes an Overview summary with the number of documents or files in the collection, its full data volume, a link to view it publicly, and any description imported from Archive.org or Archive-It.

Web archive collections from Archive-It also include the following information:

Seeds: The current number of seeds in the Archive-It web archive collection.
Crawl date: The date on which data was most recently collected from the live web.
Access: Whether the collection is visible publicly on archive-it.org or private.

The Datasets table at the bottom of this page lists the following information about each dataset created from the collection:

Name: The name of the dataset type.
Category: The dataset type's category (Collection, Network, Text, or File Formats).
State: The current status of the dataset's creation (Queued, In Process, or Finished).
Started: The date and time at which the dataset was most recently requested.
Finished: The date and time at which the dataset was most recently created successfully.
Files: The number of total files in the dataset output.

Creating ARCH datasets

To create a new dataset, start by clicking on the Datasets navigation link at the top of the screen or click on the Generate New Dataset button at the top of any dataset table:

Select the source collection from the drop-down menu, choose a dataset category from the navigation bar, and select the dataset from the menu on the left:

Dataset types

ARCH datasets are organized into four main categories. Learn more about each and the available datasets here: ARCH Datasets.

Collection - Provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
Network - Explore connections among documents in a web archive collection visually. Learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
Text - Find, extract, and analyze the text contents of documents. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, collocation, n-grams, topic modeling, geoparsing, and more.
Images - Find, describe, and graph or transcribe images from any collection. These datasets may be used to harvest archived files, visualize them as a network, and/or analyze text transcribed from images by optical character recognition (OCR).
Speech - Find, transcribe, and record named entities from speech recognized in audio and video files throughout any collection. As transcriptions, these datasets may also be used for textual analyses and enhancing item-level description.
File formats - Find, describe, and use the files contained in a web archive collection. These datasets may be used to extract information about audio, PDF, presentation, spreadsheet, video, or word processing files.

You may toggle the Sample option to include only the first 100 records from the collection in your dataset.

Dataset details

Find, use, and share any complete ARCH dataset by clicking on its name in the account's or any collection's Datasets table:

With each dataset* detail view you may:

Download and explore

Find metadata about your dataset's name, size, length, creation date, and unique hash, and a link to download it to local storage: Alternatively,

Jupyter notebooks

You may alternatively click on the "Open in Colab" link to explore web archive collection datasets with Jupyter Notebook hosted by Google Colab. These notebooks were developed for and by the Archives Unleashed project cohorts to explore command line tools for web archive collection data, including statistical, textual, and network analyses.

For an example, see the tutorial: Explore web archive data from the command line with Jupyter Notebooks.

Preview

See a visualization of a selection from your dataset (typically the first 100 rows):

Browse this preview of your full dataset contents in its native CSV or JSON format:

Share a dataset with your team by selecting the team name(s) in the drop-down menu:

To share and cite your dataset publicly, you may add it to archive.org with an ARK identifier:

* Note that ARCH does not yet support hosted Jupyter notebooks for datasets in JSON format.

Advanced use

For more detailed instructions when you are ready, see:

How to create a custom ARCH collection - Focus or combine Archive-It web archive collections and/or combine the contents from multiple collections to create datasets with more relevant contents for your research.
How to clean ARCH datasets - Filter or combine contents to create more specific ARCH datasets from a command line interface.
How to collect new web data with Archive-It - Users of the Internet Archive's web archiving service can create new collections and sync them for ARCH access and use anytime.
How to download and open ARCH dataset files - Reference these detailed instructions to download your ARCH datasets and work with them locally from the command line, desktop, or other web-based programs
How to publish ARCH datasets to archive.org - Share your dataset on archive.org, add descriptive metadata, and cite it with a unique ARK (Archival Resource Key) identifier.

Tutorials

Try it yourself: Sample ARCH datasets and how to explore them - See example ARCH datasets from Archive-It web archive collections and use free web-based, desktop, and command line tools to explore their contents.

Tools for working with ARCH datasets

Check out the recommendations for each category to find tools that you can use to explore, parse, analyze, and visualize ARCH dataset contents:

📬 Have a question for our team? Submit a support ticket here.