Thursday, December 3

BixoLabs Public Terabyte Webcrawl

Ken Krugler and the team at BixoLabs are doing some things worth noting. They created Bixo, an open-source Java web crawler built on top of the Cascading MapReduce framework. I think it's one of the best open-source options for large-scale web crawling. Going further, they have their own cluster that they use for specialized client crawls.

You should read their blog for some of the recent talks Ken gave on web mining.

Particularly interesting is that they are starting to work on a Public Terabyte Project web crawl. The code and the crawl will be available for free via Amazon's dataset hosting.

I look forward to taking a look at it more. It's an important resource, because ClueWeb09 is only available to researchers who pay for it.


  1. Amazon's public data sets seem to be usable only from within an Amazon Web Services account. How much would it cost to sign up for EC2/S3 and download the data from Amazon?

    Amazon clearly states that contributors must have the right to make the data freely available. The problem with web crawls is that they contain all sorts of copyrighted material.

    While ClueWeb09 isn't free as in free beer (or free speech), CMU's only charging US$800 to ship the data to you on four 1.5TB hard drives. They make you promise not to redistribute the copyrighted data in the crawl.

    I have no idea what the intellectual property law is around these kinds of uses.

  2. Good question about the cost of transferring from S3. The S3 cost is $.10 per GB. For a terabyte, that's $100. I would expect it to be in that range.

    I've heard conflicting reports on the IP rules.

    ClueWeb09 was crawled with Nutch, a precursor to Bixo. I know there are some encoding and spam issues. I've heard rumors that the full CatA crawl may be up to 70% spam, but haven't looked closely yet. It sounds like the Bixo team is trying to get a cleaner crawl.

  3. Hi Bob & Jeff,

    Re transferring the data out of S3 - I'm not sure about Amazon's position on this. I've gotten one-off OKs to send the data to Internet Archive and the Wayback Machine.

    Re IP rules for web crawl data. Excellent question, and yes it is confusing. The most comprehensive analysis of this that I've seen was done by "The Section 108 Study Group" in 2008. This didn't address directly whether caching of content (e.g. by Google) was considered a violation of copyright or not. It did provide guidance on when archiving of content was allowed.

    I also contacted CMU to find out what guidelines they used.

    It is a gray area, but the resulting corpus appears to be allowed under current copyright law as long as (a) content marked as not being appropriate for archiving is excluded, (b) the owner of content has the ability to have their content removed, and (c) the purpose of the archive is not for commercial re-publishing.

    The best equivalents I see are the caching of content by web crawlers, and the framing of pages by sites such as LinkedIn. I'm sure there will be additional clarifications in the future, which would be great for all involved.

    -- Ken

  4. Hi Jeff,

    Re quality of the crawl - yes, that's a serious goal and it's slowing things down quite a bit. But I'm tired of seeing stats like "We crawled 3 billion pages in 3 days" when it's clear that much of what gets crawled is junk.

    The approach we're taking is to have a white-list of high traffic English-centric domains (from Alexa stats), a black list of porn/spam sites, and the grey list of everything else. URLs from the white list are given priority. We're also experimenting with some spam/adult classifiers, to see how good they do on arbitrary pages.

    BTW, you must have some serious blog juice :) I noticed a sudden spike in traffic after this post appeared.

    -- Ken

  5. Ken, thanks for the follow-up and update.

    I'm not sure that I would trust Alexa's traffic rankings. For a minimal cost, you might consider using Compete's. Compete aggregates data from toolbar, ISP, and other sources in a way that looks to be more reliable.

    I'd love to see a post that goes into more detail on the process of managing the crawl, e.g. domain limits, etc...

    There are two recent papers on crawling you should read if you haven't yet:

    IRLbot: scaling to 6 billion pages and beyond (WWW 2008)

    The Impact of Crawl Policy on Web Search Effectiveness (SIGIR 2009)