Wednesday, December 9

ICWSM 2010 Data Challenge

The ICWSM is a conference on blogs and social media. For the conference, they issued a data challenge.
The dataset, provided by, is a set of 44 million blog posts made between August 1st and October 1st, 2008. The post includes the text as syndicated, as well as metadata such as the blog's homepage, timestamps, etc. The data is formatted in XML and is further arranged into tiers approximating to some degree search engine ranking. The total size of the dataset is 142 GB uncompressed, (27 GB compressed).
The deadline is March 1st.

Something to look at after the SIGIR deadline....

No comments:

Post a Comment