Wednesday, July 25

Google's Semantic Unit Locator

Bill Slawski over at SEO-By-The-SEA posted an article today on Google's recent patent on "Semantic Units" also known in the NLP world as Mutiword Expressions (MWE). The patent lays out a system for identifying useful phrases ('compounds') based on queries and the documents retrieved during the search.

Background: Finding Meaningful phrases
First, the authors lay out the problems with finding meaningful phrases from web documents and query logs independently:
The disadvantage with this [document] approach is that it is inefficient, because there are many more compounds in the corpus than would typically occur in user queries. Thus, only a small fraction of the detected compounds are useful in practice... Identifying all compounds on the web is computationally difficult and would require considerable amounts of storage.
However, the query log data is also problematic:
A disadvantage associated with finding compounds in query logs using statistical techniques is that word sequences occurring in query logs may not correspond to compounds in the documents. This is because queries, especially on the web, tend to be abbreviated forms of natural language sequences.
It's clear it will be some combination of the two. The key is that it is contextual based on the relevant documents returned for the query.
For example, the queries "country western mp3" and "leaving the old country western migration" both have the words "country" and "western" next to each other. Only for the first query, however, is "country western" a representative compound. Segmenting such queries correctly requires some understanding of the meaning of the query. In the second query, the compound "western migration" is more appropriate, although it occurs less frequently in general.
Google Semantic Unit Locator
The method includes generating a list of relevant documents based on individual search terms of the query and identifying a subset of documents that are the most relevant documents from the list of relevant documents. Substrings are identified for the query and a value related to the portion of the subset of documents that contains the substring is generated. Semantic units are selected from the generated substrings based on the calculated values. Finally, the list of relevant documents is refined based on the semantic units.
Example
User runs a query for "leaving the old country western migration"
  1. Generate the list of all phrases > length 1 from the user's query
    "leaving the," "leaving the old," "leaving the old country," "leaving the old country western,", etc...
  2. For the top k (say 30) documents, a fraction is calculated based on how many documents each phrase occurs in, for example "leaving the" is in 15 documents, and so FRAC = 15/30 = 1/2. This may be biased so that higher ranking documents have more weight.
  3. Select the semantic units. First, remove phrases where FRAC is below a threshold, say .25. Next, remove phrases that are subsumed by longer phrases and phrases that overlap with higher scoring phrases. This leaves "the old country" and "western migration," along with the single search term "leaving." In some cases stop words such as "the", etc... may be removed.
  4. Refine ranking of the originally retrieved results using the discovered meaningful phrases.
This semantic unit identification can be saved or even computed offline for based on query logs and used for related queries in the future.

The Economist features Globalspec

A few weeks ago the Economist featured a story on topic-specific search engines, entitled Vertical search-engines, Know your subject. Globalspec is the leading example from the story:
GlobalSpec.com, for example, a profitable search-engine for engineers, has 3.5m registered users and signs up another 20,000 each week. “They own that market,” says Charlene Li of Forrester, a consultancy.
It's great to see Globalspec getting well-deserved recognition for its hard work over a period of a decade of helping Engineers to build products and inventions that change the world.

The Economist goes on to feature health as an emerging topic area for vertical search, featuring MedStory, Healia, Healthline, and Mamma Health.

The real challenge for specialized search engines is that most users still use Google for most of their search activity and it works 'good enough' for their specialized searches. As the story writes:
... a vertical search-engine that successfully pairs a broad target market with a complicated topic can do well... But that will mean getting consumers to kick their existing search habits. A study by the Pew Internet & American Life Project, a non-profit research group, found that two-thirds of Americans researching health-related topics online started with a general search-engine. Only 27% went on to a medical site of any kind, let alone a health-search site. “The path to general search engines is well-worn and familiar,” says Susannah Fox of Pew.
Yahoo shortcuts and Google Base integration with general search engines may be enough to spell the demise of weaker vertical engines that do not continue to continue to differentiate themselves with significantly more relevant and comprehensive information coverage of their specialty.

The article concludes with three options for vertical search engines: domination in a topic, death by Google (Base), and acquisition by GYM or other large media companies seeking to expand into new media.