Overview
You may create a custom collection in order to reduce or combine the scopes of the original web archive collections before deriving ARCH datasets. This can be an especially useful pre-processing step to take if your collection/s include a lot of original source material that is irrelevant or superfluous to your research needs, or if your research materials are distributed among several different source web archive collections.
Instructions
To create a custom collection, begin by clicking the Create Custom Collection link at the top-right corner of the Collections pane on your ARCH dashboard:
Follow the instructions on the Custom Collection Builder page to select one or more source collections and define the properties of the original materials that you wish to include in your new custom collection:
SURT Prefix(es)
Sort-friendly URI Reordering Transform, or "SURT," is a left-to-right, lowercase representation of a URL from a web archive collection that better matches the natural hierarchy of domain names. SURTs in web archive collections maintained by the Internet Archive read as: tld,domain,)/path?query.
For example The Internet Archive's homepage (archive.org) has a SURT of org,archive and this help page matches the SURT com,zendesk,arch-webservices)/hc/en-us/articles/16107865758228.
ARCH supports the use of SURT prefixes to create custom collections by matching all URLs that begin with a specified SURT value. For example, a SURT prefix to match all articles in this Help Center, including this page, would read as: com,zendesk,arch-webservices)/hc/en-us/articles. Learn more.
Crawl Date
In web archiving, each document is assigned a timestamp when it is collected from the live web and written into a WARC file, in the yyyyMMddHHss format. ARCH supports filtering the contents of source collections to define the earliest (start) and/or latest (end) timestamps to include in a new custom collection when you need to scope your collection by your preferred time span.
You may use a complete timestamp like 20230615012135 to mark the end of your time span down to the second or a partial timestamp like 20151014 or 201308 to include all timestamps from a given minute, day, month, or year. Learn more.
HTTP status code
Hypertext Transfer Protocol (HTTP) response status codes are standard server responses to client requests for web documents, collected at crawl time and preserved in the WARC files that constitute web archive collections. Common status codes include 200 ("OK"), 302 ("Found"), and 404 ("Not Found"). ARCH supports filtering custom collections by status code, for instance to analyze only the successful responses in a collection. Learn more.
MIME Type
"Multipurpose Internet Mail Extensions", or "MIME Type," is an internet standard for detecting and characterizing digital file formats in the form type/subtype. ARCH supports filtering source collection materials by MIME Type in order to create a new custom collection of just the materials in your preferred file format(s). For a list of types commonly found in web archive collections, see: MIME Types.
Comments
Please sign in to leave a comment.