Try it yourself: Sample ARCH datasets and how to explore them

Karl Blumenthal

Updated April 07, 2025 17:57

Introduction

This is a guide to samples and tutorials that demonstrate the exploratory computational analysis of web archives using datasets from the Archives Research Compute Hub (ARCH). You do not need an ARCH account to try these tutorials. Download the sample datasets and follow the instructions below explore and visualize web archive collections as networks, text, and file repositories.

Materials

📥 Download the ARCH-workshop.zip archive (<30MB uncompressed) to find all of the data needed to complete the following tutorials and try your own analyses.

The samples for these tutorials were derived from the Archive-It web archive collection:

Art Galleries - Art gallery, exhibition, and dealer websites archived by the Collaborative ART Archive (CARTA) web archiving consortium.

All data were collected by institutional partners, preserved by the Internet Archive’s Archive-It web archiving service, and derived into datasets for analysis with the ARCH platform for research services.

Tutorials

In this section:

Visualize connections between web domains with RAWGraphs
Graph a network of web domains with Palladio
Graph a network of web domains with Gephi
Graph a network of web domains with Gephi Lite
Browse images from a web archive with Palladio
Mine text from a web archive with Voyant
Explore web archive data from the command line with Jupyter Notebooks

Visualize connections between web domains with RAWGraphs

How might we begin to understand the relationships between sources of data in a web archive? Unlike titles on a shelf, web resources are organized and interdependent in ways that can be hidden from the reader. Get a sense of scale and contents by plotting the domains in a web archive collection as sources and targets. Try it now >>>

Graph a network of web domains with Palladio

Map a collection of domains and trace their intersections in a web archive collection as a network graph. Try it now >>>

Graph a network of web domains with Gephi

Gephi is a popular desktop platform for creating network graph visualizations Use it to plot and scale a collection of domains from a web archive collection. Try it now >>>

Graph a network of web domains with Gephi Lite

Use Gephi Lite, a limited version of the popular desktop visualization program Gephi, to visualize domains with a web browser only. Try it now >>>

Browse images from a web archive with Palladio

Web archives contain myriad forms of expression beyond text. Aggregate their media to enable access more instinctively and intuitively, outside of the search box. Try it now >>>

Mine text from a web archive with Voyant

Web archives can be read from the collection-level scale in order to surface the broader themes, topics, people, and places that they include or share. We can use or adapt natural language processing tools to read these texts from a distance like they already do books, journals, and other corpora. Try it now >>>

Explore web archive data from the command line with Jupyter Notebooks

Browser-based tools like the above help to examine and visualize relatively small samples of data. Analyzing full datasets from web archive collections at scale can require more computing power, command line tools, and custom code refinements. Jupyter Notebooks provide an opportunity to demonstrate and even modify these manual processes quickly and easily. Try it now >>>