Wednesday, May 21

Open Source Search Engine Evaluation: Test Collections

[You may also be interested in my post describing open source search engines]

Udi Manber posted on search quality yesterday over at the Google Blog.

The Lucene project has been maturing, and part of this is getting better at performing evaluation. Grant Ingersoll, Doron Cohen (IBM), and others are pushing the state-of-the-art (see my previous post on promising results) using Lucene with TREC test collections. However, there are problems.

Grant has a great post going outlining the problems, Open Source Search Engine Relevance. The key barrier is that there is no way for your average developer to get access to GOV2 or other test collections without paying the fees (£600 for GOV2). This is unrealistic for casual system developers. Even if they did, the TREC collections are far removed from real-world uses of Lucene. Grant proposes:
So, what’s the point? I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.
Steve Green from Sun and Minion responds: Open Source TREC: TRECmentum!,
I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.
I have a question: What kind of data and tasks will the search engines be used for? This should drive test collection creation. If the primary use of the search engine is to search e-mail lists, then an e-mail test collection makes sense. However, if the the search engine is going to be used for web search, then a set of web pages is needed. You get the idea.

The documents in the collection will determine what techniques are effective and how the engine's ranking evolves. For example, with the TREC Terabyte Track, which uses GOV2, link analysis techniques (PageRank, HITS, etc...) do not improve ad-hoc relevance, but this is contrary to experience with general web search (see Amit Singhal's, compelling A Case Study in Web Search using TREC Algorithms). Another common example is word proximity. In small collections (thousands, tens of thousands, etc...) word proximity leads to little improvement in relevance. However, in large-scale collections word proximity leads to significant relevance improvements. In short, be careful to pick the documents and queries very carefully.

Before jumping to what sets of data to use, the community should look at how people want to use Lucene/Minion/etc.. and find documents and queries(!) at the appropriate level of scale. Multiple collections are needed for different use cases (enterprise search, web search, product search, etc...) Other questions to consider: Do the document collections evolve, if so, how? Is there news content? Is spam an issue? Is duplicate content a problem? These should be modeled so that document recency, authority, etc... can be incorporated into usefulness/relevance.

Interactive Retrieval?
Also, before jumping in with TREC-like collection creation it is worth considering interactive retrieval. Can users submit multiple queries? If so, the results of one query are dependent on previous ones. This is problematic if you want to compare different systems.

Copyright and distribution issues
Let's say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that model content evolution over time. GOV2 is a static crawl of 25 million government documents. Because it consists of government documents is contains little spam and is unencumbered by copyright constraints. In order to create a really large collection I think you probably need to use commercial documents. However, there's a problem, these documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don't know the answer to that question. Could that volume of data even be distributed?

One possible solution might be to host collections on a single cluster and allow researchers/developers access to run tests. Searchers could even potentially run queries using some of the production-quality systems. This might get around the problem of re-distributing copyrighted content and provide test queries and usage data. In short, something like Amazon's Alexa web search platform.

Are there ongoing efforts in this area? I know there has been talk about a "shared testbed for IR research," see Challenges in Information Retrieval and Language Modeling, but I'm not sure how far along this is or if it would be open to your average open source developer. One example of this approach are the experiments being done at the Information Retrieval Facility. The IRF has a hardware platform and data collection to facilitate research on patent retrieval, but it's not open. I'll write more about that effort in a future post.

Any thoughts?

Update: Iadh from Glasgow replied on Jon's blog, see the comments at Disqus. If re-distribution of copyrighted content is possible, why don't we see bigger and more realistic test collections? You can fit a lot of compressed content on a Terabyte hard drive. If you're working on such a collection, I'd love to hear about it.


  1. Yesterday I saw a request for participation in the creation of another test set -- crawls and click-throughs from the Chilean web. This mostly spanish-language dataset from 2003-2004 is likely to have costs associated with it, unless you participate in the relevance assessments, in which case access will be free). See their website or their email announcement for details.

  2. another datapoint re. copyright issues:
    the BLOG06 corpus is a crawl of over 100,000 publicly accessible blogs. I doubt the 100k bloggers were asked for permission to be included in the test collection. citations to the original material is included in the collection. This is re-distributed (with a more modest fee than the GOV2) also by U. Glasgow.

    There are plenty of public domain text collections that could be used: patents (as you mentioned), Wikipedia, loads of collections from, and I'm sure there's more. Many of these collections may be able to provide some sort of query logs (even via google referrers).

    My point is: the data is out there, and it doesn't necessarily need to be ad hoc web search. Even with commercial data, "fair use" provisions may get around some of the copyright issues.

  3. Just a pointer to the discussion on Jon's blog post [on disqus] on the topic, good discussion over there.

    To summarize, Iadh from the IR group at U Glasgow, which distributes the TREC collections, chimed in on the BLOG06 corpus question:

    That's why there are "organisation agreements" that clearly state how copyright issues are handled (see copyright section of the agreements). It is possible that some people might request the deletion of some files, and there is provision in the agreements to do so. In all cases, if this happens, the users of the collections bear the associated liability. There have never been such issues since the collections were first distributed by CSIRO.

    Thanks for the insight. I hope this opt-out policy is sound. It sounds like this would allow the redistribution of large test collections from crawled web documents. Is that correct?

    If so, I would love to see some new test collections created. Is there a good "How-To" guide here? What are the pitfalls test collection creators should watch out for?

  4. Anonymous8:02 AM EDT

    This comment has been removed by a blog administrator.

  5. Anonymous3:35 AM EDT

    The journey of Alexa- The Web Information Company has never been a smooth one. Alexa that has been in the scenario almost from the start of the Web had seen immense growth and popularity and then a pretty morbid phase. However, it went back to the drawing board and sketched out new ways to pull back the company to steadiness. Several new innovations and changes have been made to Alexa since the time of its inception partly to stay in sync with the changing times and demands and partly to combat its competitors. As for example, till some time back the popularity of various websites was measured by the data collected from the users of the Alexa toolbar. However, the company went for a change in April 2008. Today other factors, apart from the Alexa toolbar, are also taken into account to determine the popularity and traffic of a given website. However, the point to ponder is that the “other sources” that are determinant of a websites Alexa rankings are not clearly cut out. They are vague and misty. Nevertheless, Alexa is still one of the most credible and powerful players of the web.