How to clean ARCH datasets

Karl Blumenthal

Updated April 22, 2025 15:29

Overview

You can filter or combine the contents to create more specific ARCH datasets. Follow the instructions below to clean sample data with command line tools. You may adapt these instructions and recipes to clean your ARCH datasets.

ℹ️ To combine and/or filter source data before generating datasets, see: How to create a custom ARCH collection.

On this page:

Prerequisites
Deduplicating data with AWK
Filtering data with Grep
Combining data files
More resources

Prerequisites

Command line

These instructions require a beginner’s understanding of the command line interface. To get started, read: An Introduction To Using The Command Line Interface To Work With Files And Directories for Mac (PDF) or Windows (PDF).

Data

Download the ARCH tutorial sample materials to follow these instructions. You may adapt them to clean your datasets.

Dependencies

Windows OS users must install:

Deduplicating data with AWK

AWK is a programming language for text processing and extraction. It can be used to extract or remove rows from an ARCH dataset that contain specified values. For common deduplication applications, AWK can identify unique values or patterns of values in CSV columns.

Examples

Remove duplicate rows from a Plain text of webpages dataset based on URL column value:

awk -F, '!seen[$4]++' input.csv > output.csv

Remove duplicate rows from a Domain graph dataset based on source and target value pairing:

awk -F, '!seen[$2,$3]++' input.csv > output.csv

Filtering data with Grep

Grep (global regular expression search and print) is a command line utility for finding and filtering data with input patterns or strings. It can be used to extract or remove rows from an ARCH dataset.

Common Grep options include:

-i     Match regardless of character case distinctions.

-v     Match lines of data that do not include a provided input pattern.

-w     Match an exact word.

-c     Count the number of times that a provided pattern appears.

-E     Match an array of provided input patterns.

Examples

Filter a domain-graph.csv dataset for rows that contain "instagram.com":

grep 'instagram.com' input.csv > output.csv

Filter rows that contain popular social media domains out of a domain-graph.csv dataset:

grep -vE '(,facebook.com,|,instagram.com,|,twitter.com,)' input.csv > output.csv

Filter a Plain text of webpages dataset for rows that contain the word "sculpture":

grep -iw 'sculpture' input.csv > output.csv

Filter a Plain text of webpages dataset for rows that contain the word "covid" or "coronavirus":

grep -iE '(covid|coronavirus)' input.csv > output.csv

Data cleaning recipes

Extend, adapt, and combine these examples to clean the data relevant to your inquiry. For example:

Remove adware and ecommerce

grep -vE '(adobe.com|amazon.com|aol.com|doubleclick.net|ebay.com|google.com|list-manage|paypal.com|zendesk.com)' input.csv > output.csv

Remove analytics

grep -vE '(0px.gif|blank.gif|clear.gif|cleardot.gif|event.gif|event-tracker|gtm.js|pixel.gif|scoop.it|scorecardresearch|site_alert|statistics.php|trans.gif)' input.csv > output.csv

Remove social media widgets

grep -vE '(addthis.com|addtoany.com|digg.|disqus.|facebook.com|facebook.gif|facebook.png|facebook.svg|fb.me|foursquare.com|google.png|googlebookmark.png|hootsuite.|instagram.com|instagram.svg|linkedin.com|myspace.com|pinterest.com|pinterest.png|reddit.com|share.png|sharethis.com|snapchat.com|storify.me|stumbleupon.com|stumbleupon.png|tiktok.com|tumblr.com|tweet.png|twitter.com|twitter.gif|twitter.png|yahoo.com|yelp.com|youtube.com|youtube.png)' input.csv > output.csv

Remove web hosting documents

grep -vE '(akamaihd|apple.com|bit.ly|gmpg.org|flickr.com|flickr.png|godaddy.com|gravatar.com|gstatic.com|issuu.com|isu.pub|libsyn.|livestream.com|photobucket.com|scribd.com|sndcdn|soundcloud.com|spotify.com|squarespace.com|tinyurl.|twimg.com|twitpic.com|ustream.tv|vimeo.com|w3.org|wix.com|wp.com|wp.me|ytimg.com|zoom.com)' input.csv > output.csv

Combining data files

You can combine ARCH dataset files into single CSV or JSON files with standard utilities on Mac or PC.

To combine the sample JSONL output files into a single named entity dataset for example:

Mac

cat *.jsonl > named-entities.jsonl

Windows

type *.jsonl > named-entities.jsonl

More resources

📬 Have a question for our team? Submit a support ticket here.

Recommended tutorials for more advanced data cleaning:

How to clean ARCH datasets

Overview

Prerequisites

Command line

Data

Dependencies

Deduplicating data with AWK

Examples

Filtering data with Grep

Examples

Data cleaning recipes

Remove adware and ecommerce

Remove analytics

Remove social media widgets

Remove web hosting documents

Combining data files

Mac

Windows

More resources

Comments

Articles in this section