How to clean ARCH datasets

Overview

You can filter or combine the contents to create more specific ARCH datasets. Follow the instructions below to clean sample data with command line tools. You may adapt these instructions and recipes to clean your ARCH datasets.

ℹ️ To combine and/or filter source data before generating datasets, see: How to create a custom ARCH collection.

On this page:

 

Prerequisites

Command line

These instructions require a beginner’s understanding of the command line interface. To get started, read: An Introduction To Using The Command Line Interface To Work With Files And Directories for Mac (PDF) or Windows (PDF).

Data

Download the ARCH tutorial sample materials to follow these instructions. You may adapt them to clean your datasets.

Dependencies

Windows OS users must install:

  1. GNU awk
  2. Grep for Windows

 

Deduplicating data with AWK

AWK is a programming language for text processing and extraction. It can be used to extract or remove rows from an ARCH dataset that contain specified values. For common deduplication applications, AWK can identify unique values or patterns of values in CSV columns.

 

Examples

Remove duplicate rows from a Plain text of webpages dataset based on URL column value:

awk -F, '!seen[$4]++' input.csv > output.csv

 

Remove duplicate rows from a Domain graph dataset based on source and target value pairing:

awk -F, '!seen[$2,$3]++' input.csv > output.csv

 

Filtering data with Grep

Grep (global regular expression search and print) is a command line utility for finding and filtering data with input patterns or strings. It can be used to extract or remove rows from an ARCH dataset.

Common Grep options include:

-i     Match regardless of character case distinctions.

-v Match lines of data that do not include a provided input pattern.

-w Match an exact word.

-c Count the number of times that a provided pattern appears.

-E Match an array of provided input patterns.

 

Examples

Filter a domain-graph.csv dataset for rows that contain "instagram.com":

grep 'instagram.com' input.csv > output.csv

 

Filter rows that contain popular social media domains out of a domain-graph.csv dataset:

grep -vE '(,facebook.com,|,instagram.com,|,twitter.com,)' input.csv > output.csv

 

Filter a Plain text of webpages dataset for rows that contain the word "sculpture":

grep -iw 'sculpture' input.csv > output.csv

 

Filter a Plain text of webpages dataset for rows that contain the word "covid" or "coronavirus":

grep -iE '(covid|coronavirus)' input.csv > output.csv

 

Data cleaning recipes

Extend, adapt, and combine these examples to clean the data relevant to your inquiry. For example:

 

Remove adware and ecommerce

grep -vE '(adobe.com|amazon.com|aol.com|doubleclick.net|ebay.com|google.com|list-manage|paypal.com|zendesk.com)' input.csv > output.csv

 

Remove analytics

grep -vE '(0px.gif|blank.gif|clear.gif|cleardot.gif|event.gif|event-tracker|gtm.js|pixel.gif|scoop.it|scorecardresearch|site_alert|statistics.php|trans.gif)' input.csv > output.csv

 

Remove social media widgets

grep -vE '(addthis.com|addtoany.com|digg.|disqus.|facebook.com|facebook.gif|facebook.png|facebook.svg|fb.me|foursquare.com|google.png|googlebookmark.png|hootsuite.|instagram.com|instagram.svg|linkedin.com|myspace.com|pinterest.com|pinterest.png|reddit.com|share.png|sharethis.com|snapchat.com|storify.me|stumbleupon.com|stumbleupon.png|tiktok.com|tumblr.com|tweet.png|twitter.com|twitter.gif|twitter.png|yahoo.com|yelp.com|youtube.com|youtube.png)' input.csv > output.csv

 

Remove web hosting documents

grep -vE '(akamaihd|apple.com|bit.ly|gmpg.org|flickr.com|flickr.png|godaddy.com|gravatar.com|gstatic.com|issuu.com|isu.pub|libsyn.|livestream.com|photobucket.com|scribd.com|sndcdn|soundcloud.com|spotify.com|squarespace.com|tinyurl.|twimg.com|twitpic.com|ustream.tv|vimeo.com|w3.org|wix.com|wp.com|wp.me|ytimg.com|zoom.com)' input.csv > output.csv

 

Combining data files

You can combine ARCH dataset files into single CSV or JSON files with standard utilities on Mac or PC.

To combine the sample JSONL output files into a single named entity dataset for example:

Mac

cat *.jsonl > named-entities.jsonl 

Windows

type *.jsonl > named-entities.jsonl

 

More resources

📬 Have a question for our team? Submit a support ticket here.

Recommended tutorials for more advanced data cleaning:

 

 

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.