Ken Krugler and the team at BixoLabs are doing some things worth noting. They created Bixo, an open-source Java web crawler built on top of the Cascading MapReduce framework. I think it's one of the best open-source options for large-scale web crawling. Going further, they have their own cluster that they use for specialized client crawls.
You should read their blog for some of the recent talks Ken gave on web mining.
Particularly interesting is that they are starting to work on a Public Terabyte Project web crawl. The code and the crawl will be available for free via Amazon's dataset hosting.
I look forward to taking a look at it more. It's an important resource, because ClueWeb09 is only available to researchers who pay for it.
5 comments: