Thursday, April 17

Ellen Voorhees defends Cranfield (TREC) evaluation

Daniel has a good post:
Ellen Voorhees defends Cranfield.

At ECIR Nick Belkin and Amit Singhal both highlighted limitations of the Cranfield evaluation methodology. (For more on Cranfield see Ellen's description from 2005). This is the methodology used at TREC, and by most people in the research community.

Here's a recap of the limitations outlined at ECIR:
  • The pooling evaluation system means it is biased against revolutionary new methods that do not returned pooled documents
  • Documents and queries evolve rapidly over time and these changes are not been modeled in static test collections and query sets
  • In the real world, Cranfield style evaluations are incredibly expensive and always out of date
  • Doesn't easily allow for interactive sessions; i.e. there is no 'conversation' between the search engine and the user
  • It is far removed from the real users' environments and search tasks
There needs to be a better way.

Real search engines begin by looking at usage data and running tests on a fraction of users, but that's not something that academic researchers can reproduce.

1 comment:

  1. (1) I think pooling is independent of Cranfield/TREC is it not? Sure, pooling is often done for convenience. But the basic framework doesn't say that pooling has to be done. So I don't see how this is a criticism of the basic framework.

    (2) Sure, documents and queries do evolve. That's why you continue creating new test collections, not static ones. That's why after TREC-1 ad hoc, there was a TREC-2 adhoc.. then a TREC-3.. TREC-8.. etc. Again, that's not a criticism of Cranfield, though.

    (3) I agree that Cranfield evaluations are expensive. But evaluation of any kind, if you really want to be careful about, and certain of, what it is that you are measuring, is expensive. I don't agree that Cranfield-style evaluations are always out of date, though. Why would this be the case? Is that because 23% of all queries to a search engine are new, year over year? Or because the collection changes all the time, as new documents are added? That's fair enough. But what's the main concern or fear?

    Suppose, for example, that when trying to evaluate our system, we made an internal, checkpointed copy of the collection every 6 months, and then tested various algorithms against that static, internal copy. Suppose now that we are able to create an algorithm that gives us 30% improvement (by whatever metric) on that frozen collection. Is the concern that once you bring those algorithms live, back into the main, unfrozen collection, they will not generalize to the new, unseen data and queries? If not, why not? I assume that even when developing on your frozen collection you are following proper ML procedures, i.e. doing cross-validation, not testing on training data, etc. So because you're developing general algorithms, to begin with, how does "out of date" really affect you? Are people's searching behaviors really changing that radically every single week?

    (4) I agree, the lack of interactivity evaluation is troublesome. At the same time, however, most-to-all (web) search engines aren't really interactive, anyway. Web search engines do not provide a "search as a dialogue" service by any means. So until they actually do start doing that, we don't have anything to evaluate, anyway :-)

    (5) I also don't understand how Cranfield methodology is far removed from real users' environments. If that is indeed a problem, it is a problem with how your topics are sampled, not with the Cranfield methodology. In short, you can still follow Cranfield, and tie it to real-world information needs.