Overview
You can filter or combine the contents to create more specific ARCH datasets. Follow the instructions below to clean sample data with command line tools. You may adapt these instructions and recipes to clean your ARCH datasets.
ℹ️ To combine and/or filter source data before generating datasets, see: How to create a custom ARCH collection. |
On this page:
- Prerequisites
- Deduplicating data with AWK
- Filtering data with Grep
- Combining data files
- More resources
Prerequisites
Command line
These instructions require a beginner’s understanding of the command line interface. To get started, read: An Introduction To Using The Command Line Interface To Work With Files And Directories for Mac (PDF) or Windows (PDF).
Data
Download the ARCH tutorial sample materials to follow these instructions. You may adapt them to clean your datasets.
Dependencies
Windows OS users must install:
Deduplicating data with AWK
AWK is a programming language for text processing and extraction. It can be used to extract or remove rows from an ARCH dataset that contain specified values. For common deduplication applications, AWK can identify unique values or patterns of values in CSV columns.
Examples
Remove duplicate rows from a Plain text of webpages dataset based on URL column value:
awk -F, '!seen[$4]++' input.csv > output.csv
Remove duplicate rows from a Domain graph dataset based on source and target value pairing:
awk -F, '!seen[$2,$3]++' input.csv > output.csv
Filtering data with Grep
Grep (global regular expression search and print) is a command line utility for finding and filtering data with input patterns or strings. It can be used to extract or remove rows from an ARCH dataset.
Common Grep options include:
-i Match regardless of character case distinctions.
-v Match lines of data that do not include a provided input pattern.
-w Match an exact word.
-c Count the number of times that a provided pattern appears.
-E Match an array of provided input patterns.
Examples
Filter a domain-graph.csv dataset for rows that contain "instagram.com":
grep 'instagram.com' input.csv > output.csv
Filter rows that contain popular social media domains out of a domain-graph.csv dataset:
grep -vE '(,facebook.com,|,instagram.com,|,twitter.com,)' input.csv > output.csv
Filter a Plain text of webpages dataset for rows that contain the word "sculpture":
grep -iw 'sculpture' input.csv > output.csv
Filter a Plain text of webpages dataset for rows that contain the word "covid" or "coronavirus":
grep -iE '(covid|coronavirus)' input.csv > output.csv
Data cleaning recipes
Extend, adapt, and combine these examples to clean the data relevant to your inquiry. For example:
Remove adware and ecommerce
grep -vE '(adobe.com|amazon.com|aol.com|doubleclick.net|ebay.com|google.com|list-manage|paypal.com|zendesk.com)' input.csv > output.csv
Remove analytics
grep -vE '(0px.gif|blank.gif|clear.gif|cleardot.gif|event.gif|event-tracker|gtm.js|pixel.gif|scoop.it|scorecardresearch|site_alert|statistics.php|trans.gif)' input.csv > output.csv
Remove social media widgets
grep -vE '(addthis.com|addtoany.com|digg.|disqus.|facebook.com|facebook.gif|facebook.png|facebook.svg|fb.me|foursquare.com|google.png|googlebookmark.png|hootsuite.|instagram.com|instagram.svg|linkedin.com|myspace.com|pinterest.com|pinterest.png|reddit.com|share.png|sharethis.com|snapchat.com|storify.me|stumbleupon.com|stumbleupon.png|tiktok.com|tumblr.com|tweet.png|twitter.com|twitter.gif|twitter.png|yahoo.com|yelp.com|youtube.com|youtube.png)' input.csv > output.csv
Remove web hosting documents
grep -vE '(akamaihd|apple.com|bit.ly|gmpg.org|flickr.com|flickr.png|godaddy.com|gravatar.com|gstatic.com|issuu.com|isu.pub|libsyn.|livestream.com|photobucket.com|scribd.com|sndcdn|soundcloud.com|spotify.com|squarespace.com|tinyurl.|twimg.com|twitpic.com|ustream.tv|vimeo.com|w3.org|wix.com|wp.com|wp.me|ytimg.com|zoom.com)' input.csv > output.csv
Combining data files
You can combine ARCH dataset files into single CSV or JSON files with standard utilities on Mac or PC.
To combine the sample JSONL output files into a single named entity dataset for example:
Mac
cat *.jsonl > named-entities.jsonl
Windows
type *.jsonl > named-entities.jsonl
More resources
📬 Have a question for our team? Submit a support ticket here. |
Recommended tutorials for more advanced data cleaning:
Comments
Please sign in to leave a comment.