Saturday, April 19

Updated List of open source search engines

I updated my previous post on open source search (information retrieval) libraries. See my notes at the end for details. The only true newcomer is Galago.

Thursday, April 17

Ellen Voorhees defends Cranfield (TREC) evaluation

Daniel has a good post:
Ellen Voorhees defends Cranfield.

At ECIR Nick Belkin and Amit Singhal both highlighted limitations of the Cranfield evaluation methodology. (For more on Cranfield see Ellen's description from 2005). This is the methodology used at TREC, and by most people in the research community.

Here's a recap of the limitations outlined at ECIR:
  • The pooling evaluation system means it is biased against revolutionary new methods that do not returned pooled documents
  • Documents and queries evolve rapidly over time and these changes are not been modeled in static test collections and query sets
  • In the real world, Cranfield style evaluations are incredibly expensive and always out of date
  • Doesn't easily allow for interactive sessions; i.e. there is no 'conversation' between the search engine and the user
  • It is far removed from the real users' environments and search tasks
There needs to be a better way.

Real search engines begin by looking at usage data and running tests on a fraction of users, but that's not something that academic researchers can reproduce.

Building a Web-Scale Recommendation Engine

Paul Lamere wrote about web-scale recommendation engine projects.

He talks about Project Aura and Project Caroline.

Project Aura is a web-scale, open, hybrid recommendation system that uses social data (the wisdom of the crowds) combined with the 'aura' of information extracted directly from content or mined from the web to make recommendations. By combining content-based methods with social methods Project Aura can avoid much of the 'cold start' problems that plague traditional collaborative filtering recommenders, while providing a way to offer explainable recommendations.

Project Caroline is a platform that allows you to programmatically control all of the infrastructure resources you might need in building a horizontally scaled system. You can allocate and configure databases, file systems, private networks (VLAN's), load balancers, and a lot more, all dynamically, which makes it easy to flex the resources your application uses up and down as required.

From Pauls post:
Now that we have Project Aura running on top of Project Caroline - I'm getting used to the idea of having 60 web crawling threads feeding a 16-way datastore that is being continually indexed by our search engine - and all of this is running across some number of processors - I don't really know how many, and I don't really care.

20 Questions with Udi Manber

Popular Mechanics has a rare interview with Udi Manber, VP of search quality at Google.

Just one of many interesting highlights:

There have been a lot of fads in search of late, such as Human Assisted Search and contextual search. Do those get folded into search as a whole? What are real trends in search and what are fluff?

So let me first tell you about Google. At Google we do not manually change results. For example, if we find for a particular query that result No. 4 should be result No. 1, we do not have the capability to manually change it. We made that decision not to put that capability in the algorithm—we have to go and actually change the algorithm. That is, we have to find what weakness in the algorithm caused that result and find a general solution to that, evaluate whether a general solution really works and if it’s better, and then launch a general solution. That makes the process slower, but it puts a lot more discipline on us and makes it more unbiased.
Matt Cutts made a minor correction concerning spam and legal reasons.