Release notes: ARCH 2.1

Karl Blumenthal

Updated January 21, 2025 14:13

Overview

The Internet Archive is excited to release version 2.1 of the Archives Research Compute Hub (ARCH). This version introduces new datasets, collection types, and open source code for developers, as well as bug fixes and minor UI improvements. Browse the summaries below to learn more about each update and plans for future releases.

In this release:

Datasets
Collections
Source code
Bug fixes
Minor improvements
More information

Datasets

Added Speech recognition dataset
Added Text recognition dataset
Added named entity extraction to speech and text recognition datasets
Added in-app previews for JSON-formatted datasets
Updated named entities datasets to concatenate output into single downloadable file
Updated dataset file naming convention to include unique collection and dataset ID numbers

Collections

Added support for archive.org collections

Screenshots of the same collection on archive.org (left) and in ARCH (right)

Source code

Opened GitHub repository for ARCH’s web client, Keystone.

Bug fixes

Fixed issue obstructing datasets from combined custom collections
Fixed instructions to download datasets with the Web Archiving Systems API (WASAPI)

Minor improvements

Added email alerts for dataset errors
Added hover context to Google Colab integrations
Added detail images for collections from Archive-It and archive.org

More information

For the latest information about features planned, in research, and in development, see: ARCH development roadmap.

This release was made possible in part by the generous support of the Institute of Museum and Library Services (LG-254878-OLS-23).

Want to learn more?

📬 Be the first to know: Subscribe to ARCH updates.

🗓 Reserve a timeslot now: ARCH Office Hours.