The changes that they made are spelled out in detail on the Lucene Wiki and in their TREC result paper. Their paper is a must-read and a guide to Lucene best practices for relevance.
Results
Based on the 149 topics of the Terabyte tracks, the results of modified Lucene significantly outperform the original Lucene and are comparable to Juru’s results.
Run | MAP | P@5 | P@10 | P@20 |
| 1. Juru | 0.313 | 0.592 | 0.560 | 0.529 |
| 2. Lucene out-of-the-box | 0.154 | 0.313 | 0.303 | 0.289 |
| 3. Lucene + LA + Phrase + Sweet Spot + tf-norm | 0.306 | 0.627 | 0.589 | 0.543 |
Lucene relevance upgrades
From the wiki, they changed:
Add a proximity scoring element, basing on our experience with "Lexical affinities" in Juru. Juru creates posting lists for lexical affinities. In Lucene we used augmented the query with Span-Near-Queries.
Phrase expansion - the query text was added to the query as a phrase.
Replace the default similarity by Sweet-Spot-Similarity for a better
choice of document length normalization. Juru is using pivoted length normalization and we experimented with it, but found out that the simpler and faster sweet-spot-similarity performs better.Normalized term-frequency, as in Juru. Here, tf(freq) is normalized by the average term frequency of the document.
However, improved relevance came at a steep price. Query time went from an average of 1.4 seconds per query to 8 s/q. Significant changes are needed to make these techniques feasible in real-world systems with large corpora. See my previous post on IR engines, in particular my comments Lucene's scaling problems.
I use Lucene in RecipComun, my recipe search engine. In the near future I hope to write about some of my attempts to make Lucene relevant and fast.

0 comments:
Post a Comment