Quick guide to using ARCH

Introduction

ARCH (Archives Research Compute Hub) is a research and education service that helps users build, access, and analyze digital collections computationally at scale. ARCH is configured currently to produce research-ready datasets from Archive.org and Archive-It collections. Follow the instructions below to find your collections, create and share new datasets, and get technical support when you need it.

ℹ️ Need access to ARCH? Start here to reach our team.

ARCH was made possible in part by funding from the Mellon Foundation and via a long-running collaboration with the Archives Unleashed Project of the University of Waterloo and York University. Developing ARCH to support more user, their collection, and datasets types is made possible with funding from the Institute of Museum and Library Services (LG-254878-OLS-23).

On this page:

  1. Accessing your ARCH account
  2. Browsing your web archive collections
  3. Creating ARCH datasets
  4. Advanced use and tutorials

Accessing your ARCH account

🔑 Login to your ARCH account here: https://arch.archive-it.org/.

Your ARCH account provides quick access to a dashboard of your source collections and recently created datasets:

ARCH-Dashboard.png

Browsing your collections

Click on the Collections link in the top navigation bar anytime to view a complete list of collections to which you have ARCH access:

ARCH-Collections.png

The table includes the following information about each collection:

  • Name: Name of the collection and link to view its full details.
  • Type: Source of the collection--from Archive.org, Archive-It, or a custom creation.
  • Latest Dataset: The dataset created most recently from this collection and a link to view its full details.
  • Dataset Date: The date on which the collection's most recent dataset was created.
  • Size: The full data volume of the collection (not the latest dataset).

Click on the checkbox next to any collection, followed by the Generate Dataset button to create a new dataset anytime.

Collection details

Click on the name of any collection in your table to see the collection's full details and datasets:

ARCH_05-IA-collection.png

Each collection includes an Overview summary with the number of documents or files in the collection, its full data volume, a link to view it publicly, and any description imported from Archive.org or Archive-It.

Web archive collections from Archive-It also include the following information:

  • Seeds: The current number of seeds in the Archive-It web archive collection.
  • Crawl date: The date on which data was most recently collected from the live web.
  • Access: Whether the collection is visible publicly on archive-it.org or private.

The Datasets table at the bottom of this page lists the following information about each dataset created from the collection:

  • Name: The name of the dataset type.
  • Category: The dataset type's category (Collection, Network, Text, or File Formats).
  • State: The current status of the dataset's creation (Queued, In Process, or Finished).
  • Started: The date and time at which the dataset was most recently requested.
  • Finished: The date and time at which the dataset was most recently created successfully.
  • Files: The number of total files in the dataset output.

Creating ARCH datasets

To create a new dataset, start by clicking on the Datasets navigation link at the top of the screen or click on the Generate New Dataset button at the top of any dataset table:

ARCH1-01_Quick-guide-creating-datasets.png

Select the source collection from the drop-down menu, choose a dataset category from the navigation bar, and select the dataset from the menu on the left:

Dataset types

ARCH datasets are organized into four main categories. Learn more about each and the available datasets here: ARCH Datasets.

  1. Collection - Provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
  2. Network - Explore connections among documents in a web archive collection visually. Learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
  3. Text - Find, extract, and analyze the text contents of documents, including web pages, characters recognized in images, and speech transcribed from audio/video. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, collocation, n-grams, topic modeling, geoparsing, and more.
  4. File formats - Find, describe, and use the files contained in a web archive collection. These datasets may be used to extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.

You may toggle the Sample option to include only the first 100 records from the collection in your dataset.

Dataset details

Find, use, and share any complete ARCH dataset by clicking on its name in the account's or any collection's Datasets table:

ARCH-Datasets_Name.png

With each dataset* detail view you may:

Download and explore

Find metadata about your dataset's name, size, length, creation date, and unique hash, and a link to download it to local storage. Alternatively, load your dataset dynamically into an existing Jupyter Notebook hosted by Google Colab and explore its contents with popular command line tools:

ARCH-Datasets_Download.png

Preview

See a visualization of a selection from your dataset (typically the first 100 rows):

ARCH-Datasets_Preview.png

Browse this preview of your full dataset contents in its native CSV or JSON format:

ARCH1-05_Dataset-preview.png

Share and publish

Share a dataset with your team by selecting the team name(s) in the drop-down menu:

To share and cite your dataset publicly, you may add it to archive.org with an ARK identifier:

ARCH-Datasets_Publish.png

* Note that ARCH does not yet support hosted Jupyter notebooks for datasets in JSON format.

Advanced use

For more detailed instructions when you are ready, see:

Tutorials

Tools for working with ARCH datasets

Check out the recommendations for each category to find tools that you can use to explore, parse, analyze, and visualize ARCH dataset contents:

📬 Have a question for our team? Submit a support ticket here.

Was this article helpful?
2 out of 2 found this helpful

Comments

0 comments

Please sign in to leave a comment.