Information Retrieval research and search engine development discussion.
Thursday, March 26
TREC Billion Document Web Corpus Available
Exciting news!
The corpus of 1 billion web documents ClueWeb09 is now available. The upcoming TREC 2009 will use it. You can also see the crawl stats. It ships on 4 1.5 TB hard drives.
Looks like it's going to be very exciting data!
ReplyDeleteThat's really cool! 6 terabytes is a definite challenge.
ReplyDeleteI have twittered it but don't know your twitter name so couldn't really give you the credit, sorry.