Friday, April 25

Search and the New Economy class and videos

Panos Ipeirotis, a professor at NYU's Stern School taught a course over the winter called Search and the New Economy. The website includes useful slides and video-taped lectures. It's enlightening to see how a professor with a technical CS and IR background presents search to a business audience.

You can read more of his interesting work on his blog: A Computer Scientist in Business School. Thanks to Daniel for the pointer to Panos's blog, it's worth reading!

More on Minion and Steve Green

Steve Green has been posting (Minion Basics, Dictionary and Postings, Minion and Lucene) as Sun's Minion search engine gets closer to being made available. Stay tuned!

Steven was also featured in a Contrarian Minds article on Sun's website: Searching for Perfection.

Thursday, April 24

WWW2008 in Beijing this week

Erik has a post on the talks by Google's Kai-Fu Lee and Microsoft's Harry Shum from Live search.

Also, snippets from William Chang, Chief Scientist at Baidu.

All of the papers are now online.

Monday, April 21

Mahout: distributed machine learning using Hadoop

I saw on Grant's blog that he is working on a new project, Mahout.

The goal of Apache Mahout is to enable scalable machine learning algorithms on large clusters. It builds on the algorithms outlined in Map-Reduce for Machine Learning on Multicore. It looks like the project is just getting off the ground. It's only been active since January.

Mahout just announced that Sean Owen's Taste recommender system will be added absorbed into the project.

You can also read Jeff Eastman's blog, one of the committers on Mahout.

Accepted a position in the PhD program at UMass Amherst

It's official, I will be entering the PhD program in Comp Sci. at UMass Amherst this September. I plan to specialize in Information Retrieval at the CIIR.

I was accepted into the PhD program at UMass and the Master's program at Carnegie-Mellon in the LTI. Choosing between the two programs was a really difficult decision. However, in the end it came down to the funding offers and personal reasons (friends and family in New England, and the beautiful Amherst area). The students and faculty I met at UMass were friendly and reflected the strength of their program. I look forward to working with them this fall.

I want to especially thank Ben Ransford from UMass and Aaron Phillips from CMU. They were my student hosts and were both very gracious and generous with their time. It was a pleasure getting to know them.

I also applied to: MIT, Columbia, the University of Washington, the University of Waterloo, and the University of Maryland. There were also a number of great universities that I didn't apply to for personal reasons.

Also, over the next few months I will talk more about my plans and elaborate more on interesting research problems.

Lucene relevance ranking beats state-of-the-art at TREC

The IBM Haifa team competed in the TREC Million Query Track. They focused on improving the relevance ranking of Lucene and compared it to their own engine, Juru. Their goal was to improve Lucene's ranking and compare it to the state-of-the-art research engines used in TREC. They succeeded, coming in at (or very near, depending on metrics and evaluators) the top of the results!

The changes they made are spelled out in detail on the Lucene Wiki and in their TREC result paper. Their paper is a must-read and a guide to Lucene ranking best practices.

Results
Based on the 149 topics of the Terabyte tracks, the results of modified Lucene significantly outperform the original Lucene and are comparable to Juru’s results.

Run

MAP

P@5

P@10

P@20

1. Juru

0.313

0.592

0.560

0.529

2. Lucene out-of-the-box

0.154

0.313

0.303

0.289

3. Lucene + LA + Phrase + Sweet Spot + tf-norm

0.306

0.627

0.589

0.543


Lucene relevance upgrades

From the wiki, they changed:

  1. Add a proximity scoring element, basing on our experience with "Lexical affinities" in Juru. Juru creates posting lists for lexical affinities. In Lucene we used augmented the query with Span-Near-Queries.

  2. Phrase expansion - the query text was added to the query as a phrase.

  3. Replace the default similarity by Sweet-Spot-Similarity for a better choice of document length normalization. Juru is using pivoted length normalization and we experimented with it, but found out that the simpler and faster sweet-spot-similarity performs better.

  4. Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document.

Performance Penalty
However, improved relevance came at a steep price. Query time went from an average of 1.4 seconds per query to 8 s/q. Significant changes are needed to make these techniques feasible in real-world systems with large corpora. See my previous post on IR engines, in particular my comments Lucene's scaling problems.

I use Lucene in RecipComun, my recipe search engine. In the near future I hope to write about some of my attempts to make Lucene relevant and fast.

Sunday, April 20

Minion: A new open source search engine from Sun

Steve Green from Sun Labs just announced that they are open sourcing their Minion search engine. Minion is the search engine that ships with Sun's portal and web server.
Minion provides ranked boolean, proximity, and parametric query operators. In addition to the query operations, Minion provides document similarity operations as well as automatic document classification and document clustering capabilities.
It looks like my recent revisions to the OS search engine list already needs updating!

This looks like it could be a really great engine. I can't wait to see how this develops.