How to collect new web data with Archive-It

Overview

You may create your own web archive collections and generate ARCH datasets by collecting data with the web archiving software service, Archive-It.

â„šī¸ Archive-It is a subscription service available to any ARCH user. Learn more about service options and get started with your own account: Archive-It Products & Services.

Use the guide below to create your first web archive collection and utilize Archive-It's User Guide and FAQs to learn much more about tools and features.

In this guide:

  1. Archive-It overview
  2. Accessing your Archive-It account
  3. Creating a collection
  4. Adding websites and pages to your collection
  5. Running a web crawl
  6. Reviewing your crawl results
  7. Analyzing your collection with ARCH

Archive-It overview

Watch the video below for a <10-minute tour of the Archive-It software suite. This video can help you to get acquainted with your Archive-It account, point out where features are located, and demonstrate why you might want to use them.

(Note that this video refers to the legacy "ARS" suite of collection data features that other Archive-It partners may still use while you have access to the full and superseding ARCH platform).

 

Accessing your Archive-It account

Follow the directions in your invitation email from Archive-It to create your Archive-It account's password. You may then access your Archive-It account anytime here: https://partner.archive-it.org/login

More resources when you need them:

📄 How to set up and administer your account

❓ Forgot password?

 

Creating a collection

Once logged in, you can create a new collection by clicking on the Create a Collection button and assigning your new collection a name:

ait4arch_01_Create-collection.gif

You may create as many collections and add as many sites and pages to each as you wish. The only limitation for ARCH users is that you will be capped at 256 GB of total data archived among your collections during the pilot program period. This data can be tracked in real-time using the subscription graphic on the left-hand side of the home screen above. (Follow the instructions below for conducting test crawls in order to make the most efficient use of your data budget).

More resources when you need them:

📄 How to create and manage a collection

📄 How to monitor your data budget

 

Adding websites and pages to your collection

Click the Add Seeds button on your new collection's "Overview" page to add your first websites, feeds, and/or pages:

ait4arch_02_Add-seeds.gif

Each "seed" in your collection directs Archive-It's automated web crawling technology to a URL on the live web, where it can find the material that you want to collect and archive. This may be the home page of a website, a user's social media profile, a single news article or Wikipedia entry, or more.

Each seed can bear its own or share settings like public/private accessibility and recurring crawl schedule. Type or paste one seed URL per line in the "Add Seeds" dialog and select the following from the drop-down menus:

  1. Access: Whether the seeds' archives should be visible to the general public or only to you.
  2. Frequency: When any seeds should be archived automatically.
  3. Seed type: Select the "Standard" seed type for most websites with multiple pages or directories. Select the "One Page" option for individual web pages.

Setting your "scope"

Because web crawlers could potentially follow links forever if left to their own devices, you can apply some rules to limit how much they collect from each seed in your collection.

After applying one of the seed types above, you can tell the crawler more specifically what is "in scope" with how you format your seed URL. The crawler will read the URL string and scope-in anything that includes that same full string and additional location information to the right, for instance:

When you need to, you can also apply more specific scoping rules to seeds or to your entire collection. Click on the "Collection Scope" tab or on any specific seed, followed by the "Seed Scope" tab, to select rule options from the drop-down menu:

ait4arch_03_Scoping.gif

For instance, the rule above enables the crawler to archive the entire MLA website except for the MLA Style Handbook section. You may also expand the scope to include more URLs, impose a data limit, or ignore a robots.txt directive.

More resources when you need them:

📄 How to select your "seed" URLs and seed types

📄 How Archive-It crawlers determine scope

đŸ“ē Pre-crawl scoping

📄 Scoping guidance for specific types of sites

 

Running a web crawl

When you are ready to collect and archive them, select one or more of the seeds in your list and click the Run Crawl button. Configure your crawl job in the dialog box and click Crawl:

ait4arch_04_Crawling.gif

If this is the first time collecting a seed, we recommend selecting the option above to run a "Test Crawl." This option enables you to review the results of your crawl and either save or discard them before using any of your account's data budget. 

Other best practices for first-time crawl configurations:

  • Select ≤ 10 seeds at a time if possible, to maximize efficiency and ease post-crawl review.
  • Leave the data and document limits blank in order to run your crawl for as long as needed.
  • Select the time limit of 7 days (the crawl will end and you will be notified if it collects everything in its scope before this limit).

Once launched, you can monitor the progress of your crawl under the "Crawls" > "Current Crawls" tabs. You will receive an email notification when the crawling process completes, including a link to a full report on everything collected. Follow the instructions below to review the results and determine if any changes need be made to collect your seed/s more successfully.

If your crawl is a "One-Time" or recurring crawl (rather than a test crawl), then your data is stored at this point and can be analyzed with ARCH.

If and when you wish to collect and archive your seeds on a recurring basis, navigate to the "Crawls" > "Scheduled Crawls" tabs and click the "Schedule Crawl" button. You may start crawling immediately or select a specific date and time to begin the recurring process:

ait4arch_05_Scheduling.gif

More resources when you need them:

đŸ“ē How and why to run a test crawl

📄 How to schedule recurring crawls

 

Reviewing your crawl results

Once a crawl job completes, you can review it for completeness. This is recommended especially for test crawls and first permanent crawls before any scheduled recurring crawls commence. Follow the directions below to glean the most important information from your crawls' reports, identify any issues, and make interventions where they are helpful.

Reading crawl reports

Follow the link in your email notification or click on the unique ID number under the "Crawls" tab in order to view any crawl job's complete report. There is much more information contained in these reports than you should need at any one time, so you can follow these steps to ensure that the crawl ran properly:

  1. Review the information in your crawl's "Overview" tab in order to make sure that the crawl completed successfully (see "Status") and that the volume of material archived meets your expectations. Errant or obstructed crawls may require that you modify the scope of your collection in order to more accurately target the crawler more accurately.

    AIT4ARCH_06_Report-overview.png

    If the crawl status indicates that the crawl ended due to a time, document, or data limit, then you can extend your crawl or run a longer test crawl.

    Archive-It de-duplicates data as it crawls, so you can expect to see a difference between the "total" and "new" data collected each time, especially as crawls recur.

  2. Review the information in your crawl's Seeds report, particularly each "Seed Status" in order to make sure that all of your seeds were crawled successfully, or if alternatively any robots.txt exclusions or other errors prevented seeds from being crawled.

    ait4arch_07_Seed-report.png

  3. Review the information in your crawl's Hosts report in order to determine if any valuable hosts were blocked from crawling by robots.txt exclusions or deemed as outside the scope of your crawl. "Queued" documents that do not point to a crawler trap may indicate that the time, document, or data limit on the crawl should be extended in order to capture missing elements.

    AIT4ARCH_09_Hosts-report.png

Saving or deleting test crawls

You will have 60 days after the completion of any test crawl to decide whether to save its contents permanently or delete them. After 60 days, Archive-It deletes them automatically. Make your selection with the banner options at the top of your test crawl's "Overview" tab:

ait4arch_12_Save-test.png

Like the One-Time and scheduled crawls that are stored in your account automatically, test crawls can take up to 24 hours to process completely and then replay again in Wayback mode after you save them. Once stored completely, these crawl data can also be analyzed with ARCH.

More resources when you need them:

đŸ“ē Getting the most from your post crawl reports

📄 What should I check first in my post crawl reports? 

đŸ“ē Understanding your Hosts Report

📄 Modify scope and run patch crawls from your report

 

Analyzing your collection with ARCH

Once in your Archive-It account, you may explore, analyze, and extract the new web archive collection data with ARCH. 

More resources when you need them:

📄 ARCH Help Center

💁 Submit an ARCH support request

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.