Tutorial: Explore web archive data from the command line with Jupyter Notebooks

Karl Blumenthal

Updated March 06, 2025 16:11

<<< Back to the guide, "Sample ARCH datasets and how to explore them."

Introduction

Browser-based tools like those included in the above tutorials can help you to examine and visualize relatively small samples of data. Analyzing full ARCH datasets from web archive collections at scale can require more computing power, command line tools, and custom code refinements.

Jupyter Notebooks provide an opportunity to demonstrate and even modify these manual processes quickly. Follow the instructions below to practice using these notebooks to run popular command line tools written in the Python programming language for text analysis and visualization, named entity recognition, and sentiment analysis.

Used in this tutorial:

Dataset: Plain text of web pages from the Art Galleries web archive collection, Baltimore City detailed coverage
Tools: pandas, nltk, spaCy, vaderSentiment, and word_cloud
Time to complete: ~15-25 minutes

Thanks to Nick Ruest and the Archives Unleashed Project for authoring this notebook and all others linked to each ARCH user's dataset detail view automatically for guided exploration.

Instructions

Click the Open in Colab button to see and execute sample code step-by-step and/or preview the results with the completed version of the notebook below:

Tutorial: Explore web archive data from the command line with Jupyter Notebooks

Introduction

Instructions

Comments

Articles in this section