Overview
The Internet Archive is excited to release version 2.1 of the Archives Research Compute Hub (ARCH). This version introduces new datasets, collection types, and open source code for developers, as well as bug fixes and minor UI improvements. Browse the summaries below to learn more about each update and plans for future releases.
In this release:
Datasets
- Added Speech recognition dataset
- Added Text recognition dataset
- Added named entity extraction to speech and text recognition datasets
- Added in-app previews for JSON-formatted datasets
- Updated named entities datasets to concatenate output into single downloadable file
- Updated dataset file naming convention to include unique collection and dataset ID numbers
Collections
Screenshots of the same collection on archive.org (left) and in ARCH (right)
Source code
- Opened GitHub repository for ARCH’s web client, Keystone.
Bug fixes
- Fixed issue obstructing datasets from combined custom collections
- Fixed instructions to download datasets with the Web Archiving Systems API (WASAPI)
Minor improvements
- Added email alerts for dataset errors
- Added hover context to Google Colab integrations
- Added detail images for collections from Archive-It and archive.org
More information
For the latest information about features planned, in research, and in development, see: ARCH development roadmap.
This release was made possible in part by the generous support of the Institute of Museum and Library Services (LG-254878-OLS-23).
Want to learn more?
📬 Be the first to know: Subscribe to ARCH updates. |
🗓 Reserve a timeslot now: ARCH Office Hours. |
Comments
Please sign in to leave a comment.