Overview
You can filter or combine the contents to create more specific ARCH datasets. Follow the instructions below to clean sample data with command line tools. You may adapt these instructions and recipes to clean your ARCH datasets.
ℹ️ To combine and/or filter source data before generating datasets, see: How to create a custom ARCH collection. |
On this page:
Prerequisites
Command line
These instructions require a beginner’s understanding of the command line interface. To get started, read: An Introduction To Using The Command Line Interface To Work With Files And Directories for Mac (PDF) or Windows (PDF).
Data
Download the ARCH tutorial sample materials to follow these instructions. You may adapt them to clean your datasets.
Dependencies
Windows OS users must install Grep for Windows.
Filtering data with Grep
Grep (global regular expression search and print) is a command line utility for finding and filtering data with input patterns or strings.
Grep options include:
-i Match regardless of character case distinctions.
-v Match lines of data that do not include a provided input pattern.
-w Match an exact word.
-c Count the number of times that a provided pattern appears.
-E Match an array of provided input patterns.
Examples
Filter the domain-graph.csv dataset for rows that contain "instagram.com":
grep 'instagram.com' domain-graph.csv > domain-graph_instagram.csv
Filter rows that contain popular social media domains out of the domain-graph.csv dataset:
grep -vE '(,facebook.com,|,instagram.com,|,twitter.com,)' domain-graph.csv > domain-graph_no-social.csv
Filter the web-pages.csv dataset for rows that contain the word "sculpture":
grep -iw 'sculpture' web-pages.csv > web-pages_sculpture.csv
Filter the web-pages.csv dataset for rows that contain the word "covid" or "coronavirus":
grep -iE '(covid|coronavirus)' web-pages.csv > web-pages_covid.csv
Data cleaning recipes
Extend, adapt, and combine these examples to clean the data relevant to your inquiry. For example:
Remove adware and ecommerce
grep -vE '(adobe.com|amazon.com|aol.com|doubleclick.net|ebay.com|google.com|list-manage|paypal.com|zendesk.com)' {{input_datase_name}} > {{output_dataset_name}}
Remove analytics
grep -vE '(0px.gif|blank.gif|clear.gif|cleardot.gif|event.gif|event-tracker|gtm.js|pixel.gif|scoop.it|scorecardresearch|site_alert|statistics.php|trans.gif)' {{input_datase_name}} > {{output_dataset_name}}
Remove social media widgets
grep -vE '(addthis.com|addtoany.com|digg.|disqus.|facebook.com|facebook.gif|facebook.png|facebook.svg|fb.me|foursquare.com|google.png|googlebookmark.png|hootsuite.|instagram.com|instagram.svg|linkedin.com|myspace.com|pinterest.com|pinterest.png|reddit.com|share.png|sharethis.com|snapchat.com|storify.me|stumbleupon.com|stumbleupon.png|tiktok.com|tumblr.com|tweet.png|twitter.com|twitter.gif|twitter.png|yahoo.com|yelp.com|youtube.com|youtube.png)' {{input_datase_name}} > {{output_dataset_name}}
Remove web hosting documents
grep -vE '(akamaihd|apple.com|bit.ly|gmpg.org|flickr.com|flickr.png|godaddy.com|gravatar.com|gstatic.com|issuu.com|isu.pub|libsyn.|livestream.com|photobucket.com|scribd.com|sndcdn|soundcloud.com|spotify.com|squarespace.com|tinyurl.|twimg.com|twitpic.com|ustream.tv|vimeo.com|w3.org|wix.com|wp.com|wp.me|ytimg.com|zoom.com)' {{input_datase_name}} > {{output_dataset_name}}
Combining data files
You can combine ARCH dataset files into single CSV or JSON files with standard utilities on Mac or PC.
To combine the sample .wane files into a single named entity dataset for example:
Mac
cat *.wane > named-entities.json
Windows
type *.wane > named-entities.json
More resources
📬 Have a question for our team? Submit a support ticket here. |
Recommended tutorials for more advanced data cleaning:
Comments
Please sign in to leave a comment.