Tutorial: How to mine text from a web archive collection with Voyant

<<< Back to the guide, "Sample ARCH datasets and how to explore them."

Introduction

Web archives can be read from the collection-level scale in order to surface the broader themes, topics, people, and places that they include or share. We can use or adapt natural language processing tools to read these texts from a distance like they already do books, journals, and other corpora. This tutorial uses the popular Voyant platform of NLP tools to analyze and visualize the written contents of several websites collected around a common theme.

Used in this tutorial:

Watch

Follow along with a video demonstration of the instructions below:

 

Instructions

In this section:

  1. Get to know your data
  2. Visualize term frequencies, relationships, and context
  3. Interpret the results

Get to know your data

  1. Locate the .web-pages.csv. file in the ARCH workshop archive and open it with your preferred spreadsheet program (Excel, Calc, Numbers, Sheets, etc.).

  2. Take note of the seven attributes included in each Plain text of web pages dataset from ARCH. Each row in the spreadsheet represents the characteristics of a web page formatted to represent text (as HTML, XML, etc.) when it was collected for the archive:
    1. .crawl_date.: a timestamp representing when each page was collected.

    2. .domain.: the web domain on which the page and its text appeared originally.

    3. .url.: the location of the page on the “live” web at the time it was collected.

    4. .mime_type_web_server.: the file format of the web page as specified by its server at the time it was collected.

    5. .mime_type_tika.: the file format of the web page as determined during its dataset derivation job by the Apache Tika toolkit.

    6. .language.: the language of the text content on the web page.

    7. .content.: the full text content of each web page.

  3. For this tutorial we will focus on the text derived from the eight seeds in the Art Galleries collection contributed by the Maryland Institute College of Art, covering the city of Baltimore. Locate the .web-pages_Baltimore. folder in the workshop archive to find the .content. values above in text files organized by domain. Open one of the text files with your preferred text editor (Notepad, Wordpad, etc.) to see the page contents extracted as a single corpus of text.

Visualize term frequencies, relationships, and context

  1. Open Voyant in your preferred web browser here: https://voyant-tools.org/

  2. Click the “Upload” button at the bottom-left corner of the central “Add Texts” pane, select and open all of the text files in the folder. (It might take a minute to load all of this text fully, so do not navigate away from your view while the loading icon is active).



  3. Get to know the dashboard. Voyant enables dozens of ways to view and parse text. By default, these include (clockwise from top-left): 
    1. a word cloud,
    2. an in-line text reader,
    3. a trendline graph for terms,
    4. the context of words immediately preceding and succeeding those terms, and 
    5. a statistical summary of the vocabulary used throughout the corpus.

  4. Hover over the top-right corner of any pane to find the options to export it, change the view, or to edit processing options like stop words.  

Interpret the results

  1. Voyant offers several tools to study word colocation (words found in a row) and correlation (words found nearby) in a corpus. Click the “Links” button at the top of the top-left pane in order to replace the word cloud with a visualization of the most frequent term pairings in the corpus and the resulting links among them.
    1. Click on the line connecting the terms .new. and .york. in order to visualize the frequency with which these terms collocate among the archived sites in the “Trends” pane at the top-right corner of the screen.

    2. Which Baltimore gallery might have the closest relationship to the New York scene? Click on the bubble atop their bar in the chart in order to browse the instances of “New York” in their original context at the bottom-right.

    3. Let’s return to our video works. Click the “Clear” button with the trash bin icon in order to clear the existing terms and links from your view of the “Links” pane and start fresh.

    4. Type “video” into the search bar and select the .video (74). suggestion. Which other words collocate most frequently with video?

    5. Click on the line connecting the terms .video. and .installation. in order to visualize this colocation. Which galleries mention video installations most frequently in the group? How do the results differ for video and .art. or .works.
Was this article helpful?
4 out of 4 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.