home ¦ Archives ¦ Atom ¦ RSS

Commodity Focused Crawling?

Speaking of outsourced web crawling, I wonder if it would be possible to build a focused crawler on top of the 80legs infrastructure. A lot has changed since Filippo Menczer first introduced the concept. Sophisticated client side web programming, cloud computing, social media. Given today’s vast sprawl of the Web, there are a lot of tasks where high topical precision, completely forsaking recall, would be really useful.

As to implementation, you probably can’t get into the retrieval loop as tightly as a custom crawler, but release from the headaches of actual page fetching could free up thought cycles for creative foraging approaches. Outsourcing the hard parts would probably radically improve reliability and availability.

Another upside is that with a good front end you could scalably provide crawlers to end users. Wonder what dirt cheap personalized, adaptive, social spidering could enable.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.