Friday, March 23

Lingpipe and the stemming dilemma

Lingpipe has an interesting article on stemming: To Stem or not to Stem that is the question. It opens with:

I’d never thought too hard about stemming per se, preferring to tackle the underlying problem in applications through character n-gram indexing (we solve the same problems facing tokenized classifiers using character language models)...
It then goes on to review some of the interesting papers on stemming, concluding that the verdict on stemming is still quite inconclusive.

Carp's bottom line:
What’s a poor computational linguist to do? One thing I’d recommend is to index both the word and its stem. If you’re using a TF/IDF-based document ranker, such as the one underlying Apache Lucene, you can index both the raw word and any number of stemmed forms in the same position and hope IDF sorts them out in the long run...
I agree, stemming can lead to interesting situations where stockings become confused with the stock market... and then searchers get women's lingerie when they are looking for financial information. Tread with care.

Howeve, Carp's solution to index the stem in the same location can be problematic. If you index the raw word and the stemmed form in the same position, there should be a way to differentiate stemmed terms from non-stemmed terms for searching. Words that you have morphed through stemming might be less relevant than the original term, and different morphological changes can change the meaning of a word in different amounts (pluralization versus a noun-verb conversion). Also, if you do not make them separate terms then the stemming can have a significant impact on the IDF of common root terms -- perhaps leading to root terms becoming quite inconsequential when used on their own. An interesting (equivalent) alternative to stemming is query expansion... more on this to follow.

No comments:

Post a Comment