Thursday, March 26

TREC Billion Document Web Corpus Available

Exciting news!

The corpus of 1 billion web documents ClueWeb09 is now available. The upcoming TREC 2009 will use it. You can also see the crawl stats. It ships on 4 1.5 TB hard drives.

Let the fun begin!


  1. Michael B.5:22 PM EDT

    Looks like it's going to be very exciting data!

  2. That's really cool! 6 terabytes is a definite challenge.

    I have twittered it but don't know your twitter name so couldn't really give you the credit, sorry.