Overview
Web Archive Transformation (WAT) files enable a variety of methods of data analysis for studying web archives in aggregate and across time, including text mining, study of provenance and capture information, comparison of linking and embed characteristics, and other ways to understand collections in their entirety. They enable easier analysis by providing these key metadata in JavaScript Object Notation (JSON) format, extracted from the much larger and differently structured original W/ARC files. WATs are are generally 5%-20% of the data volume of their corresponding WARCs, making them easier to share, use, and store.
On this page:
Example uses
Mapping a collection using WATs and geolocation
For the example visual analysis above, a geo-IP Map was created for the Archive-It collection, Latin American Government Documents Archive - LAGDA. A total of 33,806 unique IP addresses were extracted from WAT files generated for the given collection over a period of 9 years and visualized using MaxMind (to geocode the IPs) and CartoDB (to create the time-based visualization). The visualization provides insight into the geographic dispersion of the servers on which this topical collection's content was hosted.
Analyzing term frequencies in metatext
For this example, terms in the metadata headers of web pages in the Hydraulic Fracking in New York State collection were plotted by frequency over time. Top terms and associated timestamps were culled from the metatags and capture date information in WAT files. Term frequencies over time were then visualized as a streamgraph using the D3 JavaScript library. The visualization shows the increase/decrease of certain terms over time within the collection.
Exploring the Canadian Political Interest Group and Political Parties Web Sphere
This example visualizes the link network among the Canadian Political Interest Groups collection, extracted from WAT files and plotted with Gephi. To learn more about how this visual analysis was created, watch the video demonstration here by historian and web archivist Ian Milligan.
Technical details
WAT files are generated via a Hadoop-based processing pipeline that includes use of Apache Pig, Java, and Python scripts. Downloaded WAT datasets will map one-to-one to W/ARCs and will be similarly packed as individual concatenated compressed records.
Each WAT record has a brief header that identifies its corresponding URL via "WARC-Target-URI," corresponding W/ARC file via "WARC-Refers-To," and other mapping information. For example, the record below corresponds to this archived webpage:
WARC/1.0
WARC-Type: metadata
WARC-Target-URI: http://wwwc.house.gov/smbiz/press/108th/2003/030604aNew.asp
WARC-Date: 2006-11-14T05:34:48Z
WARC-Record-ID: <urn:uuid:f60b604c-5b4f-4a82-859a-9bddf97f834e>
WARC-Refers-To: <urn:arc:web_con035-20061130122647-00842-crawling021.us.archive.org.arc:798>
Content-Type: application/json
Content-Length: 4254
{
"Envelope": {
"Format": "ARC",
"ARC-Header-Metadata": {
"Date": "20061114053448",
"Content-Length": "27455",
"Content-Type": "text\/html",
"Target-URI": "http:\/\/wwwc.house.gov\/smbiz\/press\/108th\/2003\/030604aNew.asp",
"IP-Address": "143.228.146.10"
},
"ARC-Header-Length": "106",
"Payload-Metadata": {
"Trailing-Slop-Length": "1",
"Actual-Content-Type": "application\/http; msgtype=response",
"HTTP-Response-Metadata": {
"Headers": {
"Date": "Tue, 14 Nov 2006 05:34:48 GMT",
"Content-Length": "27206",
"Expires": "Tue, 14 Nov 2006 05:24:48 GMT",
"Connection": "close",
"Content-Type": "text\/html",
"Server": "U.S. House of Representatives",
"X-Powered-By": "ASP.NET",
"Cache-Control": "private"
},
"Headers-Length": "249",
"Entity-Length": "27206",
"Entity-Trailing-Slop-Bytes": "0",
"Response-Message": {
"Status": "200",
"Version": "HTTP\/1.1",
"Reason": "OK"
},
"HTML-Metadata": {
"Links": [
{
"text": "Oversight Plan",
"path": "A@\/href",
"url": "..\/..\/..\/oversightPlan\/oversight_plan.asp"
},
{
"text": "Special Projects",
"path": "A@\/href",
"url": "..\/..\/..\/specialProjects\/special_projects_for_108th_congress.asp"
},
{
"text": "Committee Rules",
"path": "A@\/href",
"url": "..\/..\/..\/committeeRules\/committee_rules.asp"
},
{
"text": "Chairmans Biography",
"path": "A@\/href",
"url": "..\/..\/..\/chairmansBiography\/chairmansBiography.asp"
},
{
"text": "Committee Members",
"path": "A@\/href",
"url": "..\/..\/..\/committeeMembers\/committeeMembers.asp"
},
{
"text": "Budget Views and Estimates",
"path": "A@\/href",
"url": "..\/..\/..\/budgetViewsAndEstimates\/budgetViewsAndEstimates.asp"
},
{
"path": "IMG@\/src",
"url": "..\/..\/..\/images\/smallerHeader.jpg"
},
{
"text": "Home",
"path": "A@\/href",
"url": "..\/..\/..\/default.asp"
},
{
"text": "About The Committee",
"path": "A@\/href",
"url": "..\/..\/..\/aboutTheCommittee.asp"
},
{
"text": "Press Releases",
"path": "A@\/href",
"url": "..\/..\/asp_display_all_press_releases.asp?year=2006"
},
{
"text": "Resources",
"path": "A@\/href",
"url": "..\/..\/..\/resources\/asp_display_resources.asp"
},
{
"text": "Calendar of Events",
"path": "A@\/href",
"url": "..\/..\/..\/calendarOfEvents\/asp_calendar_of_upcoming_events.asp"
},
{
"text": "Hearings",
"path": "A@\/href",
"url": "..\/..\/..\/hearings\/databaseDrivenHearingsSystem\/displayHearings.asp?congress=109"
},
{
"text": "Subcommittees",
"path": "A@\/href",
"url": "..\/..\/..\/subcommittees\/subcommittees_main.asp"
},
{
"text": "Small Business Facts",
"path": "A@\/href",
"url": "..\/..\/..\/smallBusinessFacts\/smallBusinessFacts.asp"
},
{
"text": "Newsletters",
"path": "A@\/href",
"url": "..\/..\/..\/newsletters\/asp_display_newsletters.asp?year=2006"
},
{
"text": "Legislation",
"path": "A@\/href",
"url": "..\/..\/..\/legislation\/legislation_for_109th_congress.asp"
},
{
"text": "Contact & Location Details",
"path": "A@\/href",
"url": "..\/..\/..\/contactDetails\/contactDetails.asp"
},
{
"text": "Search The Site",
"path": "A@\/href",
"url": "..\/..\/..\/search_the_website\/search_the_website.asp"
},
{
"text": "Minority Site",
"path": "A@\/href",
"url": "http:\/\/www.house.gov\/smbiz\/democrats\/"
},
{
"text": "Printer Friendly Version",
"path": "A@\/href",
"url": "..\\..\\..\\PFV\\030604a.asp"
},
{
"path": "FORM@\/action",
"method": "post",
"url": "\/smbiz\/search_the_website\/search_results.asp"
},
{
"text": "No hearings scheduled",
"path": "A@\/href",
"url": "\/smbiz\/calendarOfEvents\/asp_calendar_event_detail.asp?eventId=113"
}
],
"Head": {
"Link": [
{
"path": "LINK@\/href",
"rel": "stylesheet",
"type": "text\/css",
"url": "..\/..\/..\/css\/styles.css"
}
],
"Metas": [
{
"content": "text\/html; charset=iso-8859-1",
"http-equiv": "Content-Type"
},
{
"content": "US",
"name": "DC.Coverage.Spatial"
},
{
"content": "United States (C,V)",
"name": "DC.Coverage.Spatial"
},
{
"content": "United States. Congress. House of Representatives. Small Business Committee",
"name": "DC.Creator"
},
{
"content": "Small Business Committee, United States House of Representatives",
"http-equiv": "Owner"
},
{
"content": "United States Government work under 17 USC secs. 105, 403",
"name": "DC.Rights"
}
],
"Title": "House of Representatives > Small Business Committee > Press Releases For 2003"
}
},
"Entity-Digest": "sha1:XRAFRBPLOTVUUKE4ZB6CKOQR5V2JUAUK"
},
"Block-Digest": "sha1:ZQYVZSRFFTEYEZO4UCLGFCDPPQR62Z4U",
"Actual-Content-Length": "27455"
}
},
"Container": {
"Compressed": true,
"Gzip-Metadata": {
"Footer-Length": "8",
"Deflate-Length": "6381",
"Header-Length": "10",
"Inflated-CRC": "1808622068",
"Inflated-Length": "27562"
},
"Offset": "798",
"Filename": "NARA-109TH-CONGRESS-2006-20061114053449-00505-crawling021.us.archive.org.arc.gz"
}
}
Comments
Please sign in to leave a comment.