Tutorial: How to mine text from a web archive collection with Voyant

<<< Back to the guide, "Sample ARCH datasets and how to explore them."

Introduction

Web archives can be read from the collection-level scale in order to surface the broader themes, topics, people, and places that they include or share. We can use or adapt natural language processing tools to read these texts from a distance like they already do books, journals, and other corpora. This tutorial uses the popular Voyant platform of NLP tools to analyze and visualize the written contents of several websites collected around a common theme.

Used in this tutorial:

Watch

Follow along with a video demonstration of the instructions below:

 

Instructions

In this section:

  1. Get to know your data
  2. Visualize term frequencies, relationships, and context
  3. Interpret the results

Get to know your data

  1. Locate the .web-pages.csv. file in the ARCH workshop archive and open it with your preferred spreadsheet program (Excel, Calc, Numbers, Sheets, etc.).

  2. Take note of the seven attributes included in each Plain text of web pages dataset from ARCH. Each row in the spreadsheet represents the characteristics of a web page formatted to represent text (as HTML, XML, etc.) when it was collected for the archive:
    1. .crawl_date.: a timestamp representing when each page was collected.

    2. .domain.: the web domain on which the page and its text appeared originally.

    3. .url.: the location of the page on the “live” web at the time it was collected.

    4. .mime_type_web_server.: the file format of the web page as specified by its server at the time it was collected.

    5. .mime_type_tika.: the file format of the web page as determined during its dataset derivation job by the Apache Tika toolkit.

    6. .language.: the language of the text content on the web page.

    7. .content.: the full text content of each web page.

Visualize term frequencies, relationships, and context

  1. Open Voyant in your preferred web browser here: https://voyant-tools.org/

  2. Click on the “Options” toggle icon at the top-right corner of the central “Add Texts” pane to configure your upload:
    1. Expand the “Tables” heading to configure the corpus as tabular data,
    2. From the “Documents” drop-down menu, select the option to extract text “from cells in each row.”
    3. Enter the number .7. Into the “Content:” field to specify the column with the corpus text.
    4. Enter the number .2. Into the “Group by column:” to organize the text by domain.
    5. Click “OK” to complete the configuration.

      Voyant_01.png

  3. Get to know the dashboard. Voyant enables dozens of ways to view and parse text. By default, these include (clockwise from top-left): 
    1. a word cloud,
    2. an in-line text reader,
    3. a trendline graph for terms,
    4. the context of words immediately preceding and succeeding those terms, and 
    5. a statistical summary of the vocabulary used throughout the corpus.

  4. Hover over the top-right corner of any pane to find the options to export it, change the view, or to edit processing options like stop words.  

Interpret the results

  1. Voyant offers several tools to study word colocation (words found in a row) and correlation (words found nearby) in a corpus. Click the “Links” button at the top of the top-left pane in order to replace the word cloud with a visualization of the most frequent term pairings in the corpus and the resulting links among them.
    1. Click on the line connecting the terms .new. and .york. in order to visualize the frequency with which these terms collocate among the archived sites in the “Trends” pane at the top-right corner of the screen.

    2. Which gallery might have the closest relationship to the New York scene? Click on the bubble atop their bar in the chart in order to browse the instances of “New York” in their original context at the bottom-right.

    3. Let’s return to our video works. Click the “Clear” button with the trash bin icon in order to clear the existing terms and links from your view of the “Links” pane and start fresh.

    4. Type “video” into the search bar and select the .video. suggestion. Which other words collocate most frequently with video?

    5. Click on the line connecting the terms .video. and .installation. in order to visualize this colocation. Which galleries mention video installations most frequently in the group? How do the results differ for video and .exhibitions. or .works.?
Was this article helpful?
4 out of 4 found this helpful

Comments

0 comments

Please sign in to leave a comment.