Thursday, January 21

Google examines synonym effectiveness in query expansion

Google has used synonyms for query expansion for several years now. It is part of their attempt to find what you mean, not just what you type. Steven Baker, an engineering on the quality team wrote a post covering a recent examination of synonym usage in query expansion. He writes,
...our measurements show that synonyms affect 70 percent of user searches across the more than 100 languages Google supports. We took a set of these queries and analyzed how precise the synonyms were, and were happy with the results: For every 50 queries where synonyms significantly improved the search results, we had only one truly bad synonym.
Another tidbit is that Google is expanding their highlighting of synonyms in search result summaries.

Lastly, a tip if you get stuck with one the 1 in 50 queries where synonyms go bad:
You can also turn off a synonym for a specific term by adding a "+" before it or by putting the words in quotation marks.
Bill Slawski has good coverage of the post, and previous work on synonym usage, including Steven's patent, Determining query term synonyms within query context.

3 comments:

  1. 1 bad synonymization for every 50 good ones? That sounds fairly incredible. Much better than any of the any synonym experiments I've ever done with TREC over the past decade.

    Why do you think their numbers are so high? Much more training data available to them (web scale)?

    ReplyDelete
  2. The approach is different from typical TREC expansion methods. It is closer to recent work on query reformulation; minor changes to the query to make it more effective.

    Clearly, the fact that they have large volumes of user query logs helps significantly.

    For traditional TREC corpora and metrics (e.g. MAP), I hypothesize that typical academic methods that use more expansion terms may improve effectiveness. The scale and efficiency are also not significant issues there.

    Google has clearly spent years working on fine-tuning their system. It hasn't received much recent attention in academia. Perhaps this will change as ClueWeb and larger web corpora are created and evaluation evolves.

    ReplyDelete
  3. Interesting post. It would be great if there is a synonym extractor that doesn't need query log data.

    ReplyDelete