Introduction
ARCH (Archives Research Compute Hub) is a research and education service that helps users build, access, and analyze digital collections computationally at scale. ARCH is configured currently to produce research-ready datasets from Archive.org and Archive-It collections. Follow the instructions below to find your collections, create and share new datasets, and get technical support when you need it.
ℹ️ Need access to ARCH? Start here to reach our team. |
ARCH was made possible in part by funding from the Mellon Foundation and via a long-running collaboration with the Archives Unleashed Project of the University of Waterloo and York University. Developing ARCH to support more user, their collection, and datasets types is made possible with funding from the Institute of Museum and Library Services (LG-254878-OLS-23).
On this page:
- Accessing your ARCH account
- Browsing your web archive collections
- Creating ARCH datasets
- Advanced use and tutorials
Accessing your ARCH account
🔑 Login to your ARCH account here: https://arch.archive-it.org/. |
Your ARCH account provides quick access to a dashboard of your source collections and recently created datasets:
Browsing your collections
Click on the Collections link in the top navigation bar anytime to view a complete list of collections to which you have ARCH access:
The table includes the following information about each collection:
- Name: Name of the collection and link to view its full details.
- Type: Source of the collection--from Archive.org, Archive-It, or a custom creation.
- Latest Dataset: The dataset created most recently from this collection and a link to view its full details.
- Dataset Date: The date on which the collection's most recent dataset was created.
- Size: The full data volume of the collection (not the latest dataset).
Click on the checkbox next to any collection, followed by the Generate Dataset button to create a new dataset anytime.
Collection details
Click on the name of any collection in your table to see the collection's full details and datasets:
Each collection includes an Overview summary with the number of documents or files in the collection, its full data volume, a link to view it publicly, and any description imported from Archive.org or Archive-It.
Web archive collections from Archive-It also include the following information:
- Seeds: The current number of seeds in the Archive-It web archive collection.
- Crawl date: The date on which data was most recently collected from the live web.
- Access: Whether the collection is visible publicly on archive-it.org or private.
The Datasets table at the bottom of this page lists the following information about each dataset created from the collection:
- Name: The name of the dataset type.
- Category: The dataset type's category (Collection, Network, Text, or File Formats).
- State: The current status of the dataset's creation (Queued, In Process, or Finished).
- Started: The date and time at which the dataset was most recently requested.
- Finished: The date and time at which the dataset was most recently created successfully.
- Files: The number of total files in the dataset output.
Creating ARCH datasets
To create a new dataset, start by clicking on the Datasets navigation link at the top of the screen or click on the Generate New Dataset button at the top of any dataset table:
Select the source collection from the drop-down menu, choose a dataset category from the navigation bar, and select the dataset from the menu on the left:
Dataset types
ARCH datasets are organized into four main categories. Learn more about each and the available datasets here: ARCH Datasets.
- Collection - Provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
- Network - Explore connections among documents in a web archive collection visually. Learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
- Text - Find, extract, and analyze the text contents of documents, including web pages, characters recognized in images, and speech transcribed from audio/video. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, collocation, n-grams, topic modeling, geoparsing, and more.
- File formats - Find, describe, and use the files contained in a web archive collection. These datasets may be used to extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.
You may toggle the Sample option to include only the first 100 records from the collection in your dataset.
Dataset details
Find, use, and share any complete ARCH dataset by clicking on its name in the account's or any collection's Datasets table:
With each dataset* detail view you may:
Download and explore
Find metadata about your dataset's name, size, length, creation date, and unique hash, and a link to download it to local storage. Alternatively, load your dataset dynamically into an existing Jupyter Notebook hosted by Google Colab and explore its contents with popular command line tools:
Preview
See a visualization of a selection from your dataset (typically the first 100 rows):
Browse this preview of your full dataset contents in its native CSV or JSON format:
Share and publish
Share a dataset with your team by selecting the team name(s) in the drop-down menu:
To share and cite your dataset publicly, you may add it to archive.org with an ARK identifier:
* Note that ARCH does not yet support hosted Jupyter notebooks for datasets in JSON format.
Advanced use
For more detailed instructions when you are ready, see:
- How to create a custom ARCH collection - Focus your web archive collections and/or combine the contents from multiple collections to create datasets with more relevant contents for your research.
- How to clean ARCH datasets - Filter or combine contents to create more specific ARCH datasets from a command line interface.
- How to collect new web data with Archive-It - Users of the Internet Archive's web archiving service can create new collections and sync them for ARCH access and use anytime.
- How to download and open ARCH dataset files - Reference these detailed instructions to download your ARCH datasets and work with them locally from the command line, desktop, or other web-based programs
- How to publish ARCH datasets to archive.org - Share your dataset on archive.org, add descriptive metadata, and cite it with a unique ARK (Archival Resource Key) identifier.
Tutorials
- Try it yourself: Sample ARCH datasets and how to explore them - See example ARCH datasets from Archive-It web archive collections and use free web-based, desktop, and command line tools to explore their contents.
Tools for working with ARCH datasets
Check out the recommendations for each category to find tools that you can use to explore, parse, analyze, and visualize ARCH dataset contents:
- ARCH Network datasets/Tool recommendations
- ARCH Text datasets/Tool recommendations
- ARCH File format datasets/Tool recommendations
📬 Have a question for our team? Submit a support ticket here. |
Comments
Please sign in to leave a comment.