ARCH incorporates terms and concepts from text and data mining, digital humanities, computer science, the web, and web archiving. Use the alphabetical list below to find short definitions and learn more about the language used to describe ARCH datasets, processes, accounts, and analyses.
Please always let our team know when we can help to explain or contextualize aspects of the ARCH service platform further and add to the list of terms.
On this page:
- Alternative text
- Anchor text
- Apache TIKA
- ARCH (Archives Research Compute Hub)
- Archive Research Services
- Archives Unleashed
- Command line
- HTTP status code
- Last modified date
- Named entity recognition (NER)
- Natural language processing (NLP)
- nltk (Natural Language Toolkit)
- Optical character recognition (OCR)
- Sentiment analysis
- SURT prefix
- Topic modeling
- Web document
Alternative text, or "alt text," is a short written description of non-text content (images, multimedia, etc.) on a web page. Alternative text inserted as an attribute in an HTML document contextualizes the purpose of multimedia content for people using screen readers, browsers that block images, and search engine optimization. Alt text extracted from images in a web archive collection may be found in the ARCH Image graph dataset. Learn more.
Anchor text is the visible, clickable text attribute encoded into an HTML hyperlink, also known as a "link label" or "link text." It contextualizes the purpose of a link for people using screen readers and for search engine optimization. Anchor text extracted from HTML web pages in a web archive may be found in the ARCH Web graph dataset. Learn more.
Apache TIKA is a content type detection and extraction software. The TIKA framework parses text from over a thousand different file types, useful for content analysis. ARCH uses Apache TIKA additionally to extract MIME metadata for several of its standard datasets. Learn more.
ARCH (Archives Research Compute Hub) is a research and teaching service provided by the Internet Archive that helps users easily build, access, analyze, publish, and preserve web archive datasets at scale. It was developed in collaboration with the Archives Unleashed project.
Archive Research Services (ARS)
Archive Research Services, or "ARS," refers to a legacy service of the Internet Archive to provide three research datasets derived from Archive-It web archive collections that are still supported by ARCH: Web Archive Transformation (WAT), Web Archive Named Entity (WANE), and Longitudinal Graph Analysis (LGA) datasets.
The Archives Unleashed Project was a collaboration among partners at the University of Waterloo, York University, and Internet Archive to enhance web archive research data access and usability through research, education, and technology development. Integration of the Archives Unleashed toolkit of cloud services to derive and deliver research datasets from Archive-It collections into the ARCH platform was supported by a grant from the Andrew W. Mellon Foundation. Learn more.
Collocation refers to a series of words or terms that co-occur and become established through repeated context-dependent use. ARCH text datasets support extracting plain text from web resources for this kind of analysis with natural language processing tools like Voyant. Learn more.
The command line is shorthand for a user interface that functions by typing commands at prompts instead of browsing windows with a mouse in order to interact with a computer system's files, directories, and programs. Programming languages like R or Python, software like pandas or nltk, and more utilities can be used from the command line in order to explore and use ARCH datasets. Learn more.
The count is a numerical summary of a given attribute in an ARCH dataset, for instance the number of times that a domain appears throughout a web archive collection in the Domain frequency dataset or the number of links between two domains in a Domain graph dataset.
"Comma-separated values," or CSV, is a digital file format, represented by filename extension .csv. Most ARCH datasets can be extracted in this format, presenting data attributes and values as plain text, which can be rendered by a desktop spreadsheet program like Excel or a command line tool like pandas. Learn more.
An ARCH dataset is a file or files of data built from web archive collection metadata, provenance information, entities, links, and/or other key elements. All current ARCH dataset types are described here: ARCH Datasets.
The domain is the text-based label that identifies the host of a web resource in its URL. It includes the type of server or sub-domain (ex. www), the host name (ex. archive-it), and the top-level domain (ex. .org, .com, .edu, etc.). ARCH Network datasets specify and enumerate the links between source and target domains in a web archive collection. The Domain frequency dataset contains a summary count of each domain present in a collection. Learn more.
Geocoding is the process of linking a text-based description of a location (such as an address, coordinates, or the proper name of a place) to a location on the Earth’s surface. ARCH text datasets and the Web Archive Transformation (WAT) dataset can enable this kind of analysis from web archive collection data. Learn more.
Geoparsing is process of linking free-text descriptions of places, such as colloquial or relative directions, to geographic identifiers like coordinates, addresses, etc. While geocoding analyzes structured location references, geoparsing handles ambiguous references using special software or services. Learn more.
Gephi is a desktop software program used to make, explore, and understand network graphs. Users can interact with graphed data to reveal patterns and isolate anomalies. It can be used to visualize ARCH network datasets in the CSV format. Learn more and practice visualizing ARCH datasets with Gephi yourself here: Tutorial: Graph a network of web domains with Gephi.
grep is a command-line utility for searching plain-text datasets for lines that match a regular expression. It can be used to extract a more manageable sample of data when working with large datasets in the CSV format. Learn more and find examples in our data cleaning recommendations.
A cryptographic hash function, or a "hash" for short, is an algorithm that maps data of an arbitrary size (referred to as the 'message') to a bit array of a fixed size (called the 'hash value' or 'message digest') They are useful for many information security, authentication, and data indexing applications. ARCH uses the MD5 and SHA1 hashes to include unique checksum values in its file format datasets. Learn more.
HTTP status code
Hypertext Transfer Protocol (HTTP) response status codes (or simply "status codes") are standard server responses to client requests for web documents, collected at crawl time and preserved in the WARC files that constitute web archive collections. Common status codes include 200 ("OK"), 302 ("Found"), and 404 ("Not Found"). ARCH supports filtering custom collections by status code, for instance to analyze only the successful responses in a collection. Learn more.
Last modified date
The "last modified date" is a timestamp provided by a web document's original server at the time that it is archived and written into the WARC file, indicating when the document was last updated on the live web. It can be used to determine the age of a digital file more specifically than the date and time at which it was collected. Learn more.
"Multipurpose Internet Mail Extensions"--abbreviated frequently as MIME, MIME Type, mimetype, or media type--is an internet standard for detecting and characterizing digital file formats. ARCH text and file format datasets include MIME types reported by original host servers at the time of collection and/or detected later by Apache TIKA. Learn more.
An n-gram is a set of co-occurring words, symbols, or tokens in a contiguous sequence, where n represents the number of collocated items. N-grams are useful for natural language processing (NLP) and text mining. You can generate them for text analysis methods from ARCH text datasets with NLP software platforms like Voyant or command line tools like nltk. Learn more.
Named entity recognition (NER)
Named entity recognition, or "NER," is an information extraction methodology that classifies text strings into predefined categories, such as personal names, organizations, and geographic locations. ARCH users may extract these named entities from web archive collections directly to the WANE dataset or else apply NER tools manually to other ARCH text datasets. Learn more.
Natural language processing (NLP)
Natural language processing, or "NLP," refers to a broad set of research methodologies that use computation to parse, analyze, and represent the text of natural human languages at aggregate scales in order to understand themes, topics, or entities in a text corpus. ARCH text datasets and the text derived from the PDF information dataset are optimal for applying NLP concepts and tools. Learn more.
nltk (Natural Language Toolkit)
Natural Language Toolkit--abbreviated frequently as NLTK or nltk--is a widely used command line toolset for natural language processing, written in the Python programming language. It is useful for interacting from the command line with ARCH text datasets and the text derived from the PDF information dataset. Learn more.
"Node" is a concept from network visualization that denotes points of reference that transact with or relate to one another in a way that may be graphed in spatial terms. The web pages, domains, documents, and images in ARCH network datasets may serve as the nodes related by their hyperlinks to one another in a graphed network.
Optical character recognition (OCR)
Optical character recognition, or "OCR," is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text. OCR software may be applied to ARCH PDF information datasets in order to extract text for natural language processing. Learn more.
OpenRefine is an open source desktop application used primarily for data cleaning. This tool can help to simplify tasks required to work with or transform messy data in the CSV file format. Learn more.
pandas is an open source data analysis and manipulation software library, built with the Python programming language. pandas can be used to interact with ARCH datasets from the command line and to create more manageable samples of data when working with especially large datasets. Learn more.
Python is a computer programming language. It is used widely to build websites, software, and to conduct data analysis. Python-supported software services can parse ARCH datasets to facilitate research. Learn more.
R is a computer programming language. It is used widely to build websites, software, and to conduct data analysis. R-supported software services can parse ARCH datasets to facilitate research. Learn more.
Scope is a web archiving concept that describes the predefined extent to which a crawler collects documents from the web. Web crawlers limit their scopes automatically and/or web archivists can apply "scoping rules" manually in order to control what a crawler does or does not collect. Knowing the scope is therefore critical to understanding a custodian's intent in creating a web archive collection. Learn more.
In the context of Archive-It, "seed" refers to the URL (for a website, a specific directory, or a specific document) that acts as 1) the starting place for a web crawl, and 2) an point for end users to access archived web documents within a collection. Seeds and the scoping rules applied to crawls will impact the precision and recall of documents available for archival or research purposes. (learn more)
Sentiment analysis refers to a collection of natural language processing methods that parse and apply quantitative values to subjective information from a collection of text, identifying opinions, appraisals, emotions, or attitudes towards a specific topic. Learn more.
Sort-friendly URI Reordering Transform, or "SURT," is a left-to-right, lowercase representation of a URL from a web archive collection that better matches the natural hierarchy of domain names. SURTs in web archive collections maintained by the Internet Archive read as: tld,domain,)/path?query. For example The Internet Archive's homepage (archive.org) has a SURT of org,archive and this glossary page matches the SURT com,zendesk,arch-webservices)/hc/en-us/articles/14410683244948. ARCH supports the use of SURT prefixes to create custom collections by matching all URLs that begin with a specified SURT value. For example, a SURT prefix to match all articles in this Help Center, including this page, would read as: com,zendesk,arch-webservices)/hc/en-us/articles. Learn more.
In web archiving, the timestamp is the date and time at which a document was collected from the web and written into a WARC file, in the yyyyMMddHHss format. All ARCH datasets include full or partial timestamps (expressed as "crawl_date") for each document, domain, or link that they describe. Learn more.
Topic modeling is an array of natural language processing methods, used to model statistically and discover the semantic topics or structures that occur in a body of text. ARCH text datasets and the text derived from the PDF information dataset are optimal for applying topic modeling concepts and tools. Learn more.
Voyant is a free text analysis platform and web application, useful for quickly and easily visualizing data and exporting visualizations. Learn more and practice visualizing ARCH datasets with Voyant yourself here: Tutorial: How to mine text from a web archive collection with Voyant.
A Web ARChive, or "WARC" file describes a container of web documents in the standard format. Rendering tools can read and represent the contents of WARC files in order to facilitate end user browsing of web archive collections. ARCH parses and extracts information from WARC files to create all of its derivative datasets. Learn more.
A web document, or document, is an individual file in a web archive collection. It may be an HTML web page, a downloadable PDF, an embedded image, or any other discrete file that may be retrieved individually. Each unique document in a web archive collection has its own record in a WARC file that may be parsed and extracted for inclusion in an ARCH dataset.