ARCH Text datasets

Overview

ARCH text datasets allow the user to find, extract, and analyze the text contents of documents in a collection. These files can be used alongside a number of different text analysis methods and techniques including sentiment analysis, named entity recognition, word frequency, collocation, n-grams, topic modelling, geoparsing, and word differences.

On this page:

Datasets

Extracted text

Text extracted from collection documents in PDF, DOC, TXT, or other text-bearing file formats. Output: one JSONL file comprising a JSON object for each input record. Values include filename and path, MIME type, and extracted text. 📥 Download an example. 📚 Read the code.

Named entities

Named entities datasets contain the persons, organizations, geographic locations, and dates from text-bearing resource in a collection, organized in JSON format by source document. ARCH provides named entities in English or Chinese from web pages and transcribed from images and speech currently. 📚 Learn more and see examples.
 

Plain text of web pages

📥 Download an example. 📚 Read the code. This dataset lists the location, technical metadata, and extracted full text contents of each HTML web page or otherwise text-bearing format of web document within a web archive collection. The CSV file presents data in the following columns: 

crawl_date

  • type: String
  • format: yyyyMMddHHss
  • example: 201610140439
  • description: The timestamp representing when the document was collected from the live web.
  • learn more: Timestamp

last_modified_date

  • type: String
  • format: yyyyMMddHHss
  • example: 201211041756
  • description: The timestamp, when available, reported by the live web server at collection time to represent when the document was last changed on the live web. 
  • learn more: Last modified date

domain

  • type: String
  • format: sld.tld
  • example: archive.org
  • description: The host site serving the document on the live web at the time of collection.
  • learn more: Domain

url

  • type: String
  • format: protocol://hostname/directory/filename
  • example: https://archive.org/about/terms.php
  • description: The location of the document on the live web at the time of collection.
  • learn more: URL

mime_type_web_server

  • type: String
  • format: type/format
  • example: text/html
  • description: The format of the document as reported by the live web server at collection time.
  • learn more: MIME

mime_type_tika

  • type: String
  • format: type/format
  • example: text/html
  • description: The format of the document as reported by ARCH, running Apache Tika.
  • learn more: Apache Tika

language

  • type: String
  • format: ISO 639-1
  • example: en
  • description: The language of the document's body content as reported by ARCH, running Apache Tika.
  • learn more: Apache Tika

content

  • type: String
  • example: Hello, world!
  • description: The plain text body content of the HTML or text-bearing web document with markup extracted.

Text files dataset 

The text files dataset lists locations, metadata, and extracted text contents for text-based web documents in the following formats: CSS, JSON, XML, plain text, JS, and HTML. This dataset is packaged as a CSV file with the following columns: crawl_date, last_modified_date, url, filename, extension, mime_type as provided by the web server and as detected by Apache TIKA, md5 and sha1 checksum values, and content. 📚 Read the code.

Example uses
 

  1. Queer Webcomics Archives Project - an Archives Unleashed Datathon project that explored the representation of queer identities within the Global Webcomics web archive collection. A variety of methods and visualizations were produced while analyzing text and domains datasets.

     
  2. Autism Discourse in the U.S: An Exploratory Analysis - another Archives Unleashed datathon project that explored the discourse among autism bloggers in the Autism and Alzheimers web archive collection, using two primary modes of analysis: sentiment analysis (using NLTK) and network analysis (Gephi).

     
  3. #teamwildfyre - an Archives Unleashed datathon project that explored the impact/severity of forest fires, as well as how information spreads and is broadcasted by media outlets. A variety of methods were used to analyze and visualize the full text dataset files, with a focus on named entity recognition (NER) and geocoding. 

     
  4. Creating Collection Growth Curves with AUT and Hypercane - these blog posts describe the process of using various tools to explore web archives collection growth curves. This can ultimately be "used to gain a better understanding of seed curation and the crawling behavior."

Austism_text.png

Example text sentiment analysis from the Autism and Alzheimers web archive collection

 

Tool recommendations

The following software and tool libraries can be used to parse and analyze the text extracted from collections.

To learn more and practice exploring ARCH text datasets, see our tutorials:

Natural language processing (NLP)

Named entity recognition

  • NER (Stanford Named Entity Recognition) - Command line tool written in Java for labeling recognized word sequences as persons, organizations, or locations. This is the tool that ARCH uses to create named entity datasets but it may also be used on other text datasets directly.
     

Topic modeling

  • BERTopic - Richly featured command line toolset for topic modeling written in Python.

     
  • CorEx Topic Model - Command line tool for topic modeling written in Python and optimized for guided domain specificity through the use of "anchor words."

     
  • MALLET - Java-based command line package for topic modeling and other machine learning applications.
Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.