How to clean ARCH datasets

Overview

You can filter or combine the contents to create more specific ARCH datasets. Follow the instructions below to clean sample data with command line tools. You may adapt these instructions and recipes to clean your ARCH datasets.

ℹ️ To combine and/or filter source data before generating datasets, see: How to create a custom ARCH collection.

On this page:

Prerequisites

Command line

These instructions require a beginner’s understanding of the command line interface. To get started, read: An Introduction To Using The Command Line Interface To Work With Files And Directories for Mac (PDF) or Windows (PDF).

Data

Download the ARCH tutorial sample materials to follow these instructions. You may adapt them to clean your datasets.

Dependencies

Windows OS users must install Grep for Windows.

Filtering data with Grep

Grep (global regular expression search and print) is a command line utility for finding and filtering data with input patterns or strings.

Grep options include:

-i     Match regardless of character case distinctions.

-v Match lines of data that do not include a provided input pattern.

-w Match an exact word.

-c Count the number of times that a provided pattern appears.

-E Match an array of provided input patterns.

Examples

Filter the domain-graph.csv dataset for rows that contain "instagram.com":

grep 'instagram.com' domain-graph.csv > domain-graph_instagram.csv

Filter rows that contain popular social media domains out of the domain-graph.csv dataset:

grep -vE '(,facebook.com,|,instagram.com,|,twitter.com,)' domain-graph.csv > domain-graph_no-social.csv

Filter the web-pages.csv dataset for rows that contain the word "sculpture":

grep -iw 'sculpture' web-pages.csv > web-pages_sculpture.csv

Filter the web-pages.csv dataset for rows that contain the word "covid" or "coronavirus":

grep -iE '(covid|coronavirus)' web-pages.csv > web-pages_covid.csv

Data cleaning recipes

Extend, adapt, and combine these examples to clean the data relevant to your inquiry. For example:

Remove adware and ecommerce

grep -vE '(adobe.com|amazon.com|aol.com|doubleclick.net|ebay.com|google.com|list-manage|paypal.com|zendesk.com)' {{input_datase_name}} > {{output_dataset_name}}

Remove analytics

grep -vE '(0px.gif|blank.gif|clear.gif|cleardot.gif|event.gif|event-tracker|gtm.js|pixel.gif|scoop.it|scorecardresearch|site_alert|statistics.php|trans.gif)' {{input_datase_name}} > {{output_dataset_name}}

Remove social media widgets

grep -vE '(addthis.com|addtoany.com|digg.|disqus.|facebook.com|facebook.gif|facebook.png|facebook.svg|fb.me|foursquare.com|google.png|googlebookmark.png|hootsuite.|instagram.com|instagram.svg|linkedin.com|myspace.com|pinterest.com|pinterest.png|reddit.com|share.png|sharethis.com|snapchat.com|storify.me|stumbleupon.com|stumbleupon.png|tiktok.com|tumblr.com|tweet.png|twitter.com|twitter.gif|twitter.png|yahoo.com|yelp.com|youtube.com|youtube.png)' {{input_datase_name}} > {{output_dataset_name}}

Remove web hosting documents

grep -vE '(akamaihd|apple.com|bit.ly|gmpg.org|flickr.com|flickr.png|godaddy.com|gravatar.com|gstatic.com|issuu.com|isu.pub|libsyn.|livestream.com|photobucket.com|scribd.com|sndcdn|soundcloud.com|spotify.com|squarespace.com|tinyurl.|twimg.com|twitpic.com|ustream.tv|vimeo.com|w3.org|wix.com|wp.com|wp.me|ytimg.com|zoom.com)' {{input_datase_name}} > {{output_dataset_name}}

Combining data files

You can combine ARCH dataset files into single CSV or JSON files with standard utilities on Mac or PC.

To combine the sample .wane files into a single named entity dataset for example:

Mac

cat *.wane > named-entities.json 

Windows

type *.wane > named-entities.json

More resources

📬 Have a question for our team? Submit a support ticket here.

Recommended tutorials for more advanced data cleaning:

 

 

Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.