<<< Back to the guide, "Sample ARCH datasets and how to explore them."
Introduction
Browser-based tools like those included in the above tutorials can help you to examine and visualize relatively small samples of data. Analyzing full ARCH datasets from web archive collections at scale can require more computing power, command line tools, and custom code refinements.
Jupyter Notebooks provide an opportunity to demonstrate and even modify these manual processes quickly. Follow the instructions below to practice using these notebooks to run popular command line tools written in the Python programming language for text analysis and visualization, named entity recognition, and sentiment analysis.
Used in this tutorial:
- Dataset: Plain text of web pages from the Art Galleries web archive collection, Baltimore City detailed coverage
- Tools: pandas, nltk, spaCy, vaderSentiment, and word_cloud
- Time to complete: ~15-25 minutes
Thanks to Nick Ruest and the Archives Unleashed Project for authoring this notebook and all others linked to each ARCH user's dataset detail view automatically for guided exploration.
Instructions
Click the Open in Colab button to see and execute sample code step-by-step and/or preview the results with the completed version of the notebook below:
Comments
Please sign in to leave a comment.