Monday, April 21

Lucene relevance ranking beats state-of-the-art at TREC

The IBM Haifa team competed in the TREC Million Query Track. They focused on improving the relevance ranking of Lucene and compared it to their own engine, Juru. Their goal was to improve Lucene's ranking and compare it to the state-of-the-art research engines used in TREC. They succeeded, coming in at (or very near, depending on metrics and evaluators) the top of the results!

The changes they made are spelled out in detail on the Lucene Wiki and in their TREC result paper. Their paper is a must-read and a guide to Lucene ranking best practices.

Based on the 149 topics of the Terabyte tracks, the results of modified Lucene significantly outperform the original Lucene and are comparable to Juru’s results.






1. Juru





2. Lucene out-of-the-box





3. Lucene + LA + Phrase + Sweet Spot + tf-norm





Lucene relevance upgrades

From the wiki, they changed:

  1. Add a proximity scoring element, basing on our experience with "Lexical affinities" in Juru. Juru creates posting lists for lexical affinities. In Lucene we used augmented the query with Span-Near-Queries.

  2. Phrase expansion - the query text was added to the query as a phrase.

  3. Replace the default similarity by Sweet-Spot-Similarity for a better choice of document length normalization. Juru is using pivoted length normalization and we experimented with it, but found out that the simpler and faster sweet-spot-similarity performs better.

  4. Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document.

Performance Penalty
However, improved relevance came at a steep price. Query time went from an average of 1.4 seconds per query to 8 s/q. Significant changes are needed to make these techniques feasible in real-world systems with large corpora. See my previous post on IR engines, in particular my comments Lucene's scaling problems.

I use Lucene in RecipComun, my recipe search engine. In the near future I hope to write about some of my attempts to make Lucene relevant and fast.


  1. Suprabhat Das2:42 AM EDT

    Hi Jeff,
    I'm very new with this Lucene and I want to index the TREC data using Lucene-2.4.1 and also want to calculate the MAP value using topics list and qrels.
    I know that there are some codes at the contrib/benchmark module (under the
    quality package.
    But can you please elaborately explain how to
    1) Index TREC data using these codes?
    2) Calculate the MAP value using topics list and qrels by using these codes?

    Thanks in advance,

    Suprabhat Das

    1. Anonymous9:25 AM EDT

      Hello Jeff:
      My question is about how much "tweeking" can be done on Lucene algorithms / indexing system. Example. ..... can it be adjusted to IR inf retrieval type standards (keeping the search method very simple) and not using such factors as page ranking, links, etc ?