Last Friday, I wrote about the Stemming Dilemma and suggested query expansion as an alternative to indexing stems. Here are some more thoughts on query expansion.
First, my definition:
Query expansion is the process of automatically adding or suggesting new search terms in response to a query. For example, a query for motor might be expanded to: (motor OR motors), pluralization; or synonyms expanded so that automobiles are included.
First, from the user's perspective query expansion can be equivalent to stemming/lemmatization. Like stemming, query expansion can improve recall, sometimes at the cost of precision (by introducing non-relevant documents).
Query expansion offers greater flexibility than indexing the stem word in same position as the original term in the document because query expansion can be enabled on a per query basis and query term weights can changed on a per query basis. Enabling an expansion on a per-query level is important because an expansion may be beneficial for one query, but be detrimental in another. One significant drawback to query expansion is that adding words to a user's query creates a more complex query that is more resource intensive to answer. Like stemming, it can also have a detrimental effect on query precision unless used with care.
Query Expansion in action
Google allows query expansion in their enterprise search product, here's the how-to from their blog.).
LucQue - A module for doing query expansion with Lucene. It uses Google's web API to find terms to use for query expansion.
Improving Automatic Query Expansion
Automatic Query Expansion Using SMART : TREC 3
Probabilistic Query Expansion Using Query Logs
Query Expansion Using Local and Global Document Analysis
Introduction to Information Retrieval, Chapter 9: Query Expansion and Relevance Feedback
It's interesting to note that most of these early papers deal with small corpuses (at least in comparison with today's web) and expand queries with hundreds of terms. Using small corpuses significant query expansion / stemming can significantly improve recall and therefore greatly improve effectiveness. However, as corpus size expands precision becomes more important and query expansion can introduce noise, reducing effectiveness (especially important for web search). Also, the extra resources needed to answer expanded queries on smaller corpuses is not as problematic as it can become on large data sets. In short, it can have significant utility on smaller corpuses, but is perhaps not as helpful on large corpuses.
Query expansion is also often closely related to relevance feedback... but that's fodder for a future post.