Wednesday, May 21

Open Source Search Engine Evaluation: Test Collections

[You may also be interested in my post describing open source search engines]

Udi Manber posted on search quality yesterday over at the Google Blog.

The Lucene project has been maturing, and part of this is getting better at performing evaluation. Grant Ingersoll, Doron Cohen (IBM), and others are pushing the state-of-the-art (see my previous post on promising results) using Lucene with TREC test collections. However, there are problems.

Grant has a great post going outlining the problems, Open Source Search Engine Relevance. The key barrier is that there is no way for your average developer to get access to GOV2 or other test collections without paying the fees (£600 for GOV2). This is unrealistic for casual system developers. Even if they did, the TREC collections are far removed from real-world uses of Lucene. Grant proposes:
So, what’s the point? I think it is time the open source search community (and I don’t mean just Lucene) develop and publish a set of TREC-style relevance judgments for freely available data that is easily obtained from the Internet. Simply put, I am wondering if there are volunteers out there who would be willing to develop a practical set of queries and judgments for datasets like Wikipedia, iBiblio, the Internet Archive, etc. We wouldn’t host these datasets, we would just provide the queries and judgments, as well as the info on how to obtain the data. Then, it is easy enough to provide simple scripts that do things like run Lucene’s contrib/benchmark Quality tasks against said data.
Steve Green from Sun and Minion responds: Open Source TREC: TRECmentum!,
I think we should collect up as many mail archives as we could get our hands on as well (I, for example, could see about getting the OpenSolaris mailing lists and associated queries) since that data tends to have an interesting structure and it could lead (eventually) to tasks like topic detection and tracking and social network analysis. I'd even have a go at seeing if we could host the evaluation collections somewhere around here, if that was helpful.
I have a question: What kind of data and tasks will the search engines be used for? This should drive test collection creation. If the primary use of the search engine is to search e-mail lists, then an e-mail test collection makes sense. However, if the the search engine is going to be used for web search, then a set of web pages is needed. You get the idea.

The documents in the collection will determine what techniques are effective and how the engine's ranking evolves. For example, with the TREC Terabyte Track, which uses GOV2, link analysis techniques (PageRank, HITS, etc...) do not improve ad-hoc relevance, but this is contrary to experience with general web search (see Amit Singhal's, compelling A Case Study in Web Search using TREC Algorithms). Another common example is word proximity. In small collections (thousands, tens of thousands, etc...) word proximity leads to little improvement in relevance. However, in large-scale collections word proximity leads to significant relevance improvements. In short, be careful to pick the documents and queries very carefully.

Before jumping to what sets of data to use, the community should look at how people want to use Lucene/Minion/etc.. and find documents and queries(!) at the appropriate level of scale. Multiple collections are needed for different use cases (enterprise search, web search, product search, etc...) Other questions to consider: Do the document collections evolve, if so, how? Is there news content? Is spam an issue? Is duplicate content a problem? These should be modeled so that document recency, authority, etc... can be incorporated into usefulness/relevance.

Interactive Retrieval?
Also, before jumping in with TREC-like collection creation it is worth considering interactive retrieval. Can users submit multiple queries? If so, the results of one query are dependent on previous ones. This is problematic if you want to compare different systems.

Copyright and distribution issues
Let's say for a minute that a web search track is interesting. A major barrier to improvements in academic and open source web search is the lack of large-scale (hundreds of millions or even billions of pages) test collections that model content evolution over time. GOV2 is a static crawl of 25 million government documents. Because it consists of government documents is contains little spam and is unencumbered by copyright constraints. In order to create a really large collection I think you probably need to use commercial documents. However, there's a problem, these documents are copyrighted! Is it possible to create a large-scale test collection of web documents that can be shared freely? I don't know the answer to that question. Could that volume of data even be distributed?

Solution?
One possible solution might be to host collections on a single cluster and allow researchers/developers access to run tests. Searchers could even potentially run queries using some of the production-quality systems. This might get around the problem of re-distributing copyrighted content and provide test queries and usage data. In short, something like Amazon's Alexa web search platform.

Are there ongoing efforts in this area? I know there has been talk about a "shared testbed for IR research," see Challenges in Information Retrieval and Language Modeling, but I'm not sure how far along this is or if it would be open to your average open source developer. One example of this approach are the experiments being done at the Information Retrieval Facility. The IRF has a hardware platform and data collection to facilitate research on patent retrieval, but it's not open. I'll write more about that effort in a future post.

Any thoughts?

Update: Iadh from Glasgow replied on Jon's blog, see the comments at Disqus. If re-distribution of copyrighted content is possible, why don't we see bigger and more realistic test collections? You can fit a lot of compressed content on a Terabyte hard drive. If you're working on such a collection, I'd love to hear about it.

Monday, May 19

Yahoo! SearchMonkey developer platform

One trend is for search engine to include third-party structured data to enrich their web search results.

This past week Yahoo! launched the SearchMonkey platform. Don't miss Search Engine Land's guide to creating a SearchMonkey application.

Here is an excerpt from Yahoo's recent blog post introducing the platform:

Developers can build two types of applications using SearchMonkey: Enhanced Results and Infobars. Enhanced Results replace the current standard results with a richer display. All the links in the Enhanced Results must point to the site to which the result refers. Infobars are appended below search results and can include metadata about the result, related links or content, or links for user actions (such as adding a movie to a Netflix queue).

1) Application Type -- Decide what type of app you want to build (Enhanced Result or Infobar) and enter basic info such as application name, description and icon.
2) Trigger URLs -- Decide the URL patterns that will trigger your app. For example, for the Enhanced Result above, the pattern would be "acmemovies.com/*"
3) Data Services -- Data Services are the structured data on which SearchMonkey apps are based. They can be created using data available in the Yahoo! Search index (via data feeds or page markup such as microformats or RDF) or by using APIs or page extraction.
4) Appearance -- Use PHP to configure how structured data should appear in the application.

At first, it sounds somewhat similar to Google's Subscribed Links in Google Co-Op because it let's content providers to add structured data to results. However, Yahoo's service is very different. First, it does not require you to pick queries or query patterns to trigger the data. Instead, it's based on the structured data associated with a URL. The biggest (and coolest) difference is that SearchMonkey offers developers the ability to completely control the presentation of their data via custom PHP code. Also, if you own your the trigger URLs via Yahoo! SiteExplorer, then your application will automatically appear for all users. It's a far cry from Google Co-op, which requires you to market co-op and get your users to subscribe to your links.

I'm impressed, I think there will be a lot of valuble application built on SearchMonkey. A $10,000 contest for best application won't hurt either ;-).

Hounder: A new open source search engine

I update my list of open source search libraries and added Hounder. From my description:
Hounder - Technically, this could also be grouped with Lucene. Hounder is a complete out of the box search engine by Flaptor. It's written in Java and includes a distributed focused crawler (that includes a classifier), indexing, and search system. It's most similar to Solr and Nutch, see their comparison. It appears to use Lucene as it's underlying search library. Hounder powers Wordpress.com's search capability. Flaptor also claims they have a 300 million document collection running on approximately 30 nodes. They released their cluster management system as Clusterfest.
Don't miss Flaptor's blog.

Future of Search blog

Danny Sullivan (from Search Engine Land) and John Battelle (Searchblog) have teamed up for a blog on the Future of Search. Don't miss it.

Sunday, May 18

Catching up: Powerset Launches, GWAP, Minion Source, Friend Connect

It's been too long since my last post, a lot has been happening. Here a few of the highlights:

Powerset launched their first product, WikiSearch, which performs semantic search over Wikipedia. It's too early to pass judgment on Powerset and its technology, but a lot of people: Fernando, Daniel, and Jonathan, and myself are still skeptical.

Luis von Ahn launched GWAP (Games With A Purpose), an extension of his past work with the ESP Game to collaboratively tag images. GWAP adds four new exciting games. Get playing!

Sun's open source search library, Minion, now has source code available for download. See my previous post. Read Steve's blog for the latest on Minion, including posts on Minion vs. Lucene. I haven't had a chance to download it and do a deep-dive, but I hope to soon.

Google launched a beta preview of Friend Connect, a platform for developers to add 'social' gadgets to their website.