This is a guide to samples and tutorials that demonstrate the exploratory computational analysis of web archives using datasets from the Archives Research Compute Hub (ARCH). You do not need an ARCH account to try these tutorials. Download the sample datasets and follow the instructions below explore and visualize web archive collections as networks, text, and file repositories.
📥 Download the ARCH-workshop.zip archive (~30MB uncompressed) to find all of the data needed to complete the following tutorials and try your own analyses.
The samples for these tutorials were derived from the Archive-It web archive collection:
- Art Galleries - Art gallery, exhibition, and dealer websites archived by the Collaborative ART Archive (CARTA) web archiving consortium.
All data were collected by institutional partners, preserved by the Internet Archive’s Archive-It web archiving service, and derived into datasets for analysis with the ARCH platform for research services.
In this section:
- Visualize connections between web domains with RAWGraphs
- Graph a network of web domains with Palladio
- Graph a network of web domains with Gephi
- Browse images from a web archive with Palladio
- Mine text from a web archive with Voyant
- Explore web archive data from the command line with Jupyter Notebooks
Visualize connections between web domains with RAWGraphs
How might we begin to understand the relationships between sources of data in a web archive? Unlike titles on a shelf, web resources are organized and interdependent in ways that can be hidden from the reader. Get a sense of scale and contents by plotting the domains in a web archive collection as sources and targets. Try it now >>>
Graph a network of web domains with Palladio
Map a collection of domains and trace their intersections in a web archive collection as a network graph. Try it now >>>
Graph a network of web domains with Gephi
Gephi is a popular desktop platform for creating network graph visualizations Use it to plot and scale a collection of domains from a web archive collection. Try it now >>>
Browse images from a web archive with Palladio
Web archives contain myriad forms of expression beyond text. Aggregate their media to enable access more instinctively and intuitively, outside of the search box. Try it now >>>
Mine text from a web archive with Voyant
Web archives can be read from the collection-level scale in order to surface the broader themes, topics, people, and places that they include or share. We can use or adapt natural language processing tools to read these texts from a distance like they already do books, journals, and other corpora. Try it now >>>
Explore web archive data from the command line with Jupyter Notebooks
Browser-based tools like the above help to examine and visualize relatively small samples of data. Analyzing full datasets from web archive collections at scale can require more computing power, command line tools, and custom code refinements. Jupyter Notebooks provide an opportunity to demonstrate and even modify these manual processes quickly and easily. Try it now >>>