Introduction
ARCH (Archives Research Compute Hub) is a research and education service that helps users build, access, and analyze digital collections computationally at scale. ARCH is configured currently to produce research-ready datasets from Archive-It and custom web archive collections. Follow the instructions below to find your collections, create and share new datasets, and get technical support when you need it.
ℹ️ Need access to ARCH? Start here to reach our team. |
ARCH was made possible in part by funding from the Mellon Foundation and via a long-running collaboration with the Archives Unleashed Project of the University of Waterloo and York University.
On this page:
- Accessing your ARCH account
- Browsing your web archive collections
- Creating ARCH datasets
- Advanced use and tutorials
Accessing your ARCH account
🔑 Login to your ARCH account here: https://arch.archive-it.org/. |
Your ARCH account provides quick access to a dashboard of your source web archive collections and recently created datasets:
Browsing your web archive collections
Click on the Collections link in the top navigation bar anytime to view a complete list of web archive collections to which you have ARCH access:
The table includes the following information about each collection:
- Name: Name of the web archive collection and link to view its full details.
- Public: Whether or not the collection is visible publicly on archive-it.org (Yes/No)
- Latest Dataset: The dataset created most recently from this collection and a link to view its full details.
- Dataset Date: The date on which the collection's most recent dataset was created.
- Size: The full volume of the web archive collection (not the latest dataset).
Click on the checkbox next to any collection, followed by the Generate Dataset button to create a new dataset anytime.
Collection details
Click on the name of any web archive collection in your table to see the collection's full details and datasets:
Each collection includes an Overview summary with the following information:
- Seeds: The current number of seeds in the Archive-It web archive collection.
- Crawl date: The date on which data was most recently collected from the live web.
- Data: The current, full data volume of WARC files in this collection.
- Access: Whether the collection is visible publicly on archive-it.org or private.
- Public Collection Link: URL to view the collection on archive-it.org if it is public.
The Datasets table at the bottom of this page lists the following information about each dataset created from the collection:
- Name: The name of the dataset type.
- Category: The dataset type's category (Collection, Network, Text, or File Formats).
- State: The current status of the dataset's creation (Queued, In Process, or Finished).
- Started: The date and time at which the dataset was most recently requested.
- Finished: The date and time at which the dataset was most recently created successfully.
- Files: The number of total files in the dataset.
Creating ARCH datasets
To create a new dataset, start by clicking on the Datasets navigation link at the top of the screen or click on the Generate New Dataset button at the top of any dataset table:
Select the source collection from the drop-down menu, choose a dataset category from the navigation bar, and select the dataset from the menu on the left:
Dataset types
ARCH datasets are organized into four main categories. Learn more about each and the available datasets here: ARCH Datasets.
- Collection - Provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
- Network - Explore connections among documents in a web archive visually. Learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.
- Text - Find, extract, and analyze the text contents of HTML pages and other documents. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, collocation, n-grams, topic modeling, geoparsing, and more.
- File formats - Find, describe, and use the files contained in a web archive. These datasets may be used to extract information about audio, image, PDF, presentation, spreadsheet, video, or word processing files.
You may toggle the Sample option to include only the first 100 records from the collection in your dataset.
Dataset details
Find, use, and share any complete ARCH dataset by clicking on its name in the account's or any collection's Datasets table:
With each dataset* detail view you may:
Download and explore
Find metadata about your dataset's name, size, length, creation date, and unique hash, and a link to download it to local storage. Alternatively, load your dataset dynamically into an existing Jupyter Notebook hosted by Google Colab and explore its contents with popular command line tools:
Preview
See a visualization of a selection from your dataset (typically the first 100 rows):
Browse this preview of your full dataset contents in rows and columns:
Share and publish
Share a dataset with your team by selecting the team name(s) in the drop-down menu:
To share and cite your dataset publicly, you may add it to archive.org with an ARK identifier:
* Note that three datasets do not yet include previews or Google Colab notebooks: WAT, named entity, and LGA.
Advanced use
For more detailed instructions when you are ready, see:
- How to create a custom ARCH collection - Focus your collections and/or combine the contents from multiple collections to create datasets with more relevant contents for your research.
- How to clean ARCH datasets - Filter or combine contents to create more specific ARCH datasets from a command line interface.
- How to collect new web data with Archive-It - Users of the Internet Archive's web archiving service can create new collections and sync them for ARCH access and use anytime.
- How to download and open ARCH dataset files - Reference these detailed instructions to download your ARCH datasets and work with them locally from the command line, desktop, or other web-based programs
- How to publish ARCH datasets to archive.org - Share your dataset on archive.org, add descriptive metadata, and cite it with a unique ARK (Archival Resource Key) identifier.
Tutorials
- Try it yourself: Sample ARCH datasets and how to explore them - See example ARCH datasets from Archive-It web archive collections and use free web-based, desktop, and command line tools to explore their contents.
Tools for working with ARCH datasets
Check out the recommendations for each category to find tools that you can use to explore, parse, analyze, and visualize ARCH dataset contents:
- ARCH Network datasets/Tool recommendations
- ARCH Text datasets/Tool recommendations
- ARCH File format datasets/Tool recommendations
📬 Have a question for our team? Submit a support ticket here. |
Comments
Please sign in to leave a comment.