ARCH Network datasets

Overview

ARCH network datasets offer an opportunity to explore connections among documents in a web archive visually. They enable you to learn such things as: which websites have the most in-bound or out-bound links; which paths can be taken through the networks that connect pages; and which communities exist within the same link structure.

These files can be loaded into network analysis programs like Gephi or NodeXL, or they may be parsed and analyzed by various R or Python-based software packages.

On this page:

Datasets

Networks are composed of "edges" (the hyperlinks between pages) and "nodes" (the webpages, images, or domains). ARCH provides these data in four standard packages:

Domain graph

The ARCH domain graph dataset counts links between domains in a web archive collection over time. A CSV file presents data in the following columns: crawl date, source, target, and count📥 Download an example. 📚 Read the code.

Image Graph

The ARCH image graph dataset records each unique image referenced in a web archive collection, its URL, when it was collected, the page on which it appeared, and any alt text associated with it. A CSV file presents data in the following columns: crawl date, source, url, and alt_text. 📥 Download an example. 📚 Read the code.

Web graph

The ARCH web graph dataset documents all links between documents in a web archive collection over time and any descriptive anchor text associated with them. A CSV file presents data in the following columns: crawl date, source, target, and anchor_text. 📥 Download an example.  📚 Read the code.

Longitudinal graph analysis (LGA)

The Longitudinal Graph Analysis (LGA) dataset contains a complete list of timestamped links between URLs throughout a web archive collection. It is packaged as a compressed .tgz file containing two kinds of compressed .gz files of JSON data: an ID-Map file of unique IDs and SURTs for all URLs and an ID-Graph file of links by their source URL IDs, timestamps, and the target URL IDs. 📚 Learn more and see an example.

 

Example uses

  1. Network Analysis of the UK Government Web Archives. A team of researchers at a National Archives workshop asked the research question, “how is the government web linked together at different points in time, and how might this have changed over the last 10 years?” This blog post explores their findings and the process of using Gephi for analysis.

  2. Contemporary Composers Web Archives. An Archives Unleashed Datathon project team used the domain graph dataset and Gephi to explore connection strength, community, and dominant nodes among websites.

    Composers_network.png

    Example domain graph visualization from the Contemporary Composers web archive

Tool recommendations

Gephi

Gephi is a well-known and established open-source visualization and exploration software for graphs and networks. Using the datasets from ARCH, a researcher can explore and conduct link and social network analyses. The project offers a variety of guides for using Gephi features.

Juxta

Juxta is a shell script for generating image collages to embed in webpages. The Archives Unleashed project created this helpful guide to creating a Juxta collage with data from the Image Graph derivative file: Creating Juxta Collages with Web Archive Images.

Tutorials

See our sample ARCH data tutorials to practice with short network dataset explorations using RAWGraphs, Palladio, and Gephi.

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.