Thursday, April 9

Glasgow Releases New Blog '08 Test Collection

Iadh just announced on the Terrier Team blog that they released the new collection of blog data that will be used for the TREC 2009 Blog Track (details on the wiki). If you plan on participating, he suggests filling out the paper work asap. You can also see the stats. A high level view is that it has 28,488,767 blog posts from 1,303,520 blog feeds crawled from Jan 2008 to Mid Feb 2009.

The time period for this dataset overlaps significantly with the full-web ClueWeb09 crawl. This will make cross-dataset connections interesting.

Thanks to the team at Glasgow for the work making this happen.

Out of curiousity, I wonder if it's somehow possible to put together a parallel Twitter corpus...

2 comments:

  1. Indeed, now I can have a day off!

    Last time someone released a Twitter corpus, Twitter had them take it down. Cant find the URL atm. Would definitely need to negotiate with Twitter on that one!

    ReplyDelete
  2. I hope you manage to get in some time sailing in those frigid Scottish waters.

    ReplyDelete