Quick guide to using ARCH

Introduction

ARCH (Archives Research Compute Hub) is a research and education service that helps users build, access, and analyze digital collections computationally at scale. ARCH is configured currently to produce research-ready datasets from Archive-It and custom web archive collections. Follow the instructions below to find your collections, create and share new datasets, and get technical support when you need it.

ℹ️ Need access to ARCH? Start here to reach our team.

ARCH was made possible in part by funding from the Mellon Foundation and via a long-running collaboration with the Archives Unleashed Project of the University of Waterloo and York University.

On this page:

  1. Accessing your ARCH account
  2. Browsing your web archive collections
  3. Creating ARCH datasets
  4. Advanced use and tutorials

Accessing your ARCH account

🔑 Login to your ARCH account here: https://arch.archive-it.org/.

Your ARCH account provides quick access to a dashboard of your source web archive collections and recently created datasets:

ARCH-Dashboard.png

Browsing your web archive collections

Click on the Collections link in the top navigation bar anytime to view a complete list of web archive collections to which you have ARCH access:

ARCH-Collections.png

The table includes the following information about each collection:

  • Name: Name of the web archive collection and link to view its full details.
  • Public: Whether or not the collection is visible publicly on archive-it.org (Yes/No)
  • Latest Dataset: The dataset created most recently from this collection and a link to view its full details.
  • Dataset Date: The date on which the collection's most recent dataset was created.
  • Size: The full volume of the web archive collection (not the latest dataset).

Click on the checkbox next to any collection, followed by the Generate Dataset button to create a new dataset anytime.

Collection details

Click on the name of any web archive collection in your table to see the collection's full details and datasets:

ARCH-Collections_Detail.png

Each collection includes an Overview summary with the following information:

  • Seeds: The current number of seeds in the Archive-It web archive collection.
  • Crawl date: The date on which data was most recently collected from the live web.
  • Data: The current, full data volume of WARC files in this collection.
  • Access: Whether the collection is visible publicly on archive-it.org or private.
  • Public Collection Link: URL to view the collection on archive-it.org if it is public.

The Datasets table at the bottom of this page lists the following information about each dataset created from the collection:

  • Name: The name of the dataset type.
  • Category: The dataset type's category (Collection, Network, Text, or File Formats).
  • State: The current status of the dataset's creation (Queued, In Process, or Finished).
  • Started: The date and time at which the dataset was most recently requested.
  • Finished: The date and time at which the dataset was most recently created successfully.
  • Files: The number of total files in the dataset.

Creating ARCH datasets

To create a new dataset, start by clicking on the Datasets navigation link at the top of the screen or click on the Generate New Dataset button at the top of any dataset table:

ARCH1-01_Quick-guide-creating-datasets.png

Select the source collection from the drop-down menu, choose a dataset category from the navigation bar, and select the dataset from the menu on the left:

Dataset types

ARCH datasets are organized into four main categories. Learn more about each and the available datasets here: ARCH Datasets.

  1. Collection - Provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
  2. Network - Explore connections among documents in a web archive visually. Learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
  3. Text - Find, extract, and analyze the text contents of HTML pages and other documents. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, collocation, n-grams, topic modeling, geoparsing, and more.
  4. File formats - Find, describe, and use the files contained in a web archive. These datasets may be used to extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.

You may toggle the Sample option to include only the first 100 records from the collection in your dataset.

Dataset details

Find, use, and share any complete ARCH dataset by clicking on its name in the account's or any collection's Datasets table:

ARCH-Datasets_Name.png

With each dataset* detail view you may:

Download and explore

Find metadata about your dataset's name, size, length, creation date, and unique hash, and a link to download it to local storage. Alternatively, load your dataset dynamically into an existing Jupyter Notebook hosted by Google Colab and explore its contents with popular command line tools:

ARCH-Datasets_Download.png

Preview

See a visualization of a selection from your dataset (typically the first 100 rows):

ARCH-Datasets_Preview.png

Browse this preview of your full dataset contents in rows and columns:

ARCH1-05_Dataset-preview.png

Share and publish

Share a dataset with your team by selecting the team name(s) in the drop-down menu:

To share and cite your dataset publicly, you may add it to archive.org with an ARK identifier:

ARCH-Datasets_Publish.png

* Note that three datasets do not yet include previews or Google Colab notebooks: WAT, named entity, and LGA.

Advanced use

For more detailed instructions when you are ready, see:

Tutorials

Tools for working with ARCH datasets

Check out the recommendations for each category to find tools that you can use to explore, parse, analyze, and visualize ARCH dataset contents:

📬 Have a question for our team? Submit a support ticket here.

Was this article helpful?
2 out of 2 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.