Overview
ARCH named entities datasets contain the people, places, organizations, and dates from the text of a web archive collection, organized by originating URL and timestamp. They enable researches to find and analyze the named entities in a collection over time. Extracting named entities from a web archive collection also enhances opportunities for discovery and sharing through linked data.
ARCH supports named entity extraction in English or Chinese with the Stanford Named Entity Recognizer (NER) currently with more models and languages planned.
On this page:
Example uses
Named entities in the Human Rights collection
For the example above, named entity data was derived from one month (October 2014) of crawls in Columbia University's Human Rights collection, representing more than 300GB of web data.
Top entities in the Ferguson collection
Named entities were extracted from four months of crawls in the Internet Archive's collaborative collection of URLs related to events in Ferguson, MO. The top ten person names, among over 650,000 total in the collection, highlighted both expected and unexpected results.
Top Entities in the Chicago Architecture Biennial collection
Web archivist Karl Blumenthal used a named entity dataset to determine the most discussed designers among press and social media coverage of the first Chicago Architecture Biennial. In this blog post he describes how the results were achieved and what counter-narratives they propose to contemporaneous press coverage.
Technical details
A named entity dataset encodes entities from any textual document in a collection with a 200 HTTP response code. Records are organized by URL and include the CDX/C attributes from each source in addition to its named entities. Source code for named entity extraction is available here in the ARCH GitHub repository.
An example English language record corresponding to this archived web page looks like this:
{"record":{"redirectUrl":"-","timestamp":"20240201150017","digest":"sha1:smohlwyiu2ja5tpwscj2sibq6t4wi4la","originalUrl":"https://queerchinauk.com/","surtUrl":"com,queerchinauk)/","mime":"text/html","meta":"-","status":200},"payload":{"string":{"html":{"body":{"text":{"entities":{"persons":["Jamie Chi","Anna Shvets"],"organizations":["Queer China UK Zine Project Queer China UK","National Gallery","Transnational Chinese Queer Leadership Program","Museum of Transology"],"locations":["London","Taiwan","Hongkong","Mainland China","UK","West End"],"dates":["2022","2021","2023"]}}}}}}}
Languages
To generate a dataset in your preferred language (currently Chinese or English), select your option from the drop-down menu under Language:
An example Chinese language record corresponding to this archived web page looks like this:
{"record":{"redirectUrl":"-","timestamp":"20200509121841","digest":"sha1:lwqase64e4xrt3anqkovuo62stupxabv","originalUrl":"http://www.guomedia.org/2020/05/blog-post_86.html","surtUrl":"org,guomedia)/2020/05/blog-post_86.html","mime":"text/html","meta":"-","status":200},"payload":{"string":{"html":{"body":{"text":{"entities":{"persons":["杜特尔特","副国安","金正恩","刘冰案","陈破空",",","图巴鲁","萧铭","孙大骆","傅政华","习近平","马克龙","崔天凯","李洪志","邓炳强","石涛","王毅","邓丽君","麦燕庭","莫里森","彭培奥","陈","亨特","孙力军","孙力","福雷斯特","黎智英","福雷斯特早","普京","乔良","(","赖建平","孙杨","林郑促立","郭"],"organizations":["中共","欧盟","司法部长","海外版 香港放宽防疫禁令 港澳办","法轮功","白宫","中国外交部军控司","中共海军","中国日报审查欧盟","英议会保守党","路透社","孔子学院","世卫组织","中共解放军","中共军方","国安内部","央视","美国白宫","国会","中共公安部","欧盟驻华代表团","白宫审查华为","美军","福雷斯特","卫生部"],"locations":["欧美","欧洲","中东","亚太","欧","中南海","东海","台海"],"dates":[]}}}}}}}
History
The Named Entities dataset was introduced in 2014 along with the Web Archive Transformation (WAT) and Longitudinal Graph Analysis (LGA) datasets that comprised the legacy Archive Research Services (ARS). Prior to June 2024, ARCH also produced English language Named Entity datasets in the WANE format, with a ".wane" file extension.
The WANE dataset encoded entities from any textual document in a collection with a 200 HTTP response code. Records were organized by URL and each included a timestamp and unique checksum value for the source in addition to its named entities. Source code for creating WANE files is still available here in the ARCH GitHub repository.
An example English language WANE record corresponding to this archived web page looks like this:
{"url":"http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?like_comment=79&_wpnonce=0fc57aa499&replytocom=93","timestamp":"20141019212346","named_entities":{"locations":["North County","America","St. Louis County St. Louis County Police St. Louis County","St. Louis","WordPress.com","Middle East"],"organizations":["Dissonant Winston Smith Dissonant Winston Smith Menu Skip","Twitter Facebook Google","Google","Facebook","Wal-Mart","CNN","Bearcats"],"persons":["Stell","Tom Jackson","Smith","Pamela Fillingim","Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson","Ferguson","Rob Crawford","Kley","Erin Miller","darren wilson","Mike","Daniel Garrelts","Darren Wilson","Rath","Ellis Wyatt","Nick","Wilson","Mike Browns","Trayvon","Jane Jacoby","Kley Potter","Mike Brown","Michael","Michael Brown","Angela","Pablo","Jon Stewart","George Zimmerman Jamilah Nasheed KTVI","mike brown","Heather","Pamela fillingim","pamela fillingim","Susan"]},"digest":"sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX"}
.wane files mapped one-to-one to original .arc and/or .warc files. (0-byte WANE files are therefore possible in the case that a corresponding W/ARC file has no recognizable named entities).
Comments
0 comments
Please sign in to leave a comment.