Friday, December 5

Should we use Amazon Public Data Sets for test collections

Amazon has a new service, Public Data Sets, where it provides free hosting on EC2 for collections of public data across different domains. This makes it simple to download them or perform computation on Amazon's S3 service.

Should IR groups be using it or a similar model to distribute and perform processing of test collections?

For example, there will likely be a billion document web corpus for TREC 2009. However, there's concern over the number of groups with the resources able to handle a collection that large.


  1. I tried to think of reasons to answer 'No' to the question posed in the title, but I could not find any good ones.

    Although, perhaps a better way to put it would be "Should we use a model similar to the Amazon Public Data Sets for test collections?". Then, the answer is definitely 'Yes'.

  2. That will teach me to ask a yes/no question to programmers. If the answer is yes, then why aren't we doing it?

    For example, I know that in many cases the TREC datasets aren't 'public'; they require usage agreements and fees to acquire them.

  3. Of course this kind of thing would be nice. Often, however, it is more of a legal than technical issue. As Amazon says: "You must have the right to make the data freely available."

  4. Anonymous8:17 AM EST

    In an ideal world the large research datasets would be hosted at Amazon and I would pay them to use EC2, instead of paying for acquiring the dataset and then processing them locally.

    For instance, consider the current cost of 800$ for the .GOV2 dataset. This value could easily pay up to 2000 hours in an EC2 Large Linux/UNIX Instance. I am sure Amazon could easily lower the prices for academic/research usage.