Basic Common Crawl Processing

Posted on: Sun 28 October 2012

Pavel Repin copiously documents his initial foray into processing the Common Crawl data set:

At my company, we are building infrastructure that enables us to perform computations involving large bodies of text data.

To get familiar with the tech involved, I started with a simple experiment: using Common Crawl metadata corpus, count crawled URLs grouped by top level domain (TLD).

…

It’s not a very exciting query. To be blunt, it’s a pretty boring one. But that’s the point: this is a new ground for me, so I start simple, and capture what I’ve learned.

This gist is a fully-fledged Git repo with all the code necessary to run this query, so if you want to play with the code yourself, go ahead clone this thing.

Via Pete Warden