home ¦ Archives ¦ Atom ¦ RSS

5 Billion Pages

Ever since I stopped my personal Twitter data collection project, I’ve been mentally casting about for a new dataset to build my Mad Data Skillz ™ (Boyeeeee!). Obviously, just restarting the tweet inflow is an option, but something involving more scale with less work would be nice.

Enter Common Crawl, a non-profit making a large — 5 Billion Web page — crawl publicly and freely available on Amazon EC2. How juicy! A big dataset conveniently located within the premier, openly available, utility computing infrastructure in the world. Definitely has potential to put the Skillz to the test. Common Crawl even has a convenient series of blog posts instructing one on how to process their page repository.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.