ARCH collection datasets provide a high level overview of web archive collections. These files enable exploration of domain-related information and patterns. They may be especially useful in conjunction with other ARCH datasets.
Domain frequency
The ARCH domain frequency dataset is a CSV file with the following columns: domain and count. It records the number of unique documents collected from each domain throughout a web archive collection. 📥 Download an example. 📚 Read the code.
A "Web Archive Transformation" (WAT) file contains metadata extracted from WARC-Info and WARC record headers and HTML headers, meta tags, and anchor tags, stored in JSON format. 📚 Learn more and see an example.
Example uses
- #teamwildfyre - this Archives Unleashed datathon project used domain collection frequency files to identify and understand the origins of the data within a web archive collection.
Example domain count visualization from the British Columbia Wildfires 2021 web archive
Please sign in to leave a comment.