The Lucene project has been maturing, and part of this is getting better at performing evaluation. Grant Ingersoll, Doron Cohen (IBM), and others are pushing the state-of-the-art (see my previous post on promising results) using Lucene with TREC test collections. However, there are problems.
Grant has a great post going outlining the problems, Open Source Search Engine Relevance. The key barrier is that there is no way for your average developer to get access to GOV2 or other test collections without paying the fees (£600 for GOV2). This is unrealistic for casual system developers. Even if they did, the TREC collections are far removed from real-world uses of Lucene. Grant proposes:
So, what’s the point? I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.Steve Green from Sun and Minion responds: Open Source TREC: TRECmentum!,
I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.I have a question: What kind of data and tasks will the search engines be used for? This should drive test collection creation. If the primary use of the search engine is to search e-mail lists, then an e-mail test collection makes sense. However, if the the search engine is going to be used for web search, then a set of web pages is needed. You get the idea.
The documents in the collection will determine what techniques are effective and how the engine's ranking evolves. For example, with the TREC Terabyte Track, which uses GOV2, link analysis techniques (PageRank, HITS, etc...) do not improve ad-hoc relevance, but this is contrary to experience with general web search (see Amit Singhal's, compelling A Case Study in Web Search using TREC Algorithms). Another common example is word proximity. In small collections (thousands, tens of thousands, etc...) word proximity leads to little improvement in relevance. However, in large-scale collections word proximity leads to significant relevance improvements. In short, be careful to pick the documents and queries very carefully.
Before jumping to what sets of data to use, the community should look at how people want to use Lucene/Minion/etc.. and find documents and queries(!) at the appropriate level of scale. Multiple collections are needed for different use cases (enterprise search, web search, product search, etc...) Other questions to consider: Do the document collections evolve, if so, how? Is there news content? Is spam an issue? Is duplicate content a problem? These should be modeled so that document recency, authority, etc... can be incorporated into usefulness/relevance.
Interactive Retrieval?
Also, before jumping in with TREC-like collection creation it is worth considering interactive retrieval. Can users submit multiple queries? If so, the results of one query are dependent on previous ones. This is problematic if you want to compare different systems.
Copyright and distribution issues
Let's say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that model content evolution over time. GOV2 is a static crawl of 25 million government documents. Because it consists of government documents is contains little spam and is unencumbered by copyright constraints. In order to create a really large collection I think you probably need to use commercial documents. However, there's a problem, these documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don't know the answer to that question. Could that volume of data even be distributed?
Solution?
One possible solution might be to host collections on a single cluster and allow researchers/developers access to run tests. Searchers could even potentially run queries using some of the production-quality systems. This might get around the problem of re-distributing copyrighted content and provide test queries and usage data. In short, something like Amazon's Alexa web search platform.
Are there ongoing efforts in this area? I know there has been talk about a "shared testbed for IR research," see Challenges in Information Retrieval and Language Modeling, but I'm not sure how far along this is or if it would be open to your average open source developer. One example of this approach are the experiments being done at the Information Retrieval Facility. The IRF has a hardware platform and data collection to facilitate research on patent retrieval, but it's not open. I'll write more about that effort in a future post.
Any thoughts?
Update: Iadh from Glasgow replied on Jon's blog, see the comments at Disqus. If re-distribution of copyrighted content is possible, why don't we see bigger and more realistic test collections? You can fit a lot of compressed content on a Terabyte hard drive. If you're working on such a collection, I'd love to hear about it.
