Wednesday, July 25

Google's Semantic Unit Locator

Bill Slawski over at SEO-By-The-SEA posted an article today on Google's recent patent on "Semantic Units" also known in the NLP world as Mutiword Expressions (MWE). The patent lays out a system for identifying useful phrases ('compounds') based on queries and the documents retrieved during the search.

Background: Finding Meaningful phrases
First, the authors lay out the problems with finding meaningful phrases from web documents and query logs independently:
The disadvantage with this [document] approach is that it is inefficient, because there are many more compounds in the corpus than would typically occur in user queries. Thus, only a small fraction of the detected compounds are useful in practice... Identifying all compounds on the web is computationally difficult and would require considerable amounts of storage.
However, the query log data is also problematic:
A disadvantage associated with finding compounds in query logs using statistical techniques is that word sequences occurring in query logs may not correspond to compounds in the documents. This is because queries, especially on the web, tend to be abbreviated forms of natural language sequences.
It's clear it will be some combination of the two. The key is that it is contextual based on the relevant documents returned for the query.
For example, the queries "country western mp3" and "leaving the old country western migration" both have the words "country" and "western" next to each other. Only for the first query, however, is "country western" a representative compound. Segmenting such queries correctly requires some understanding of the meaning of the query. In the second query, the compound "western migration" is more appropriate, although it occurs less frequently in general.
Google Semantic Unit Locator
The method includes generating a list of relevant documents based on individual search terms of the query and identifying a subset of documents that are the most relevant documents from the list of relevant documents. Substrings are identified for the query and a value related to the portion of the subset of documents that contains the substring is generated. Semantic units are selected from the generated substrings based on the calculated values. Finally, the list of relevant documents is refined based on the semantic units.
User runs a query for "leaving the old country western migration"
  1. Generate the list of all phrases > length 1 from the user's query
    "leaving the," "leaving the old," "leaving the old country," "leaving the old country western,", etc...
  2. For the top k (say 30) documents, a fraction is calculated based on how many documents each phrase occurs in, for example "leaving the" is in 15 documents, and so FRAC = 15/30 = 1/2. This may be biased so that higher ranking documents have more weight.
  3. Select the semantic units. First, remove phrases where FRAC is below a threshold, say .25. Next, remove phrases that are subsumed by longer phrases and phrases that overlap with higher scoring phrases. This leaves "the old country" and "western migration," along with the single search term "leaving." In some cases stop words such as "the", etc... may be removed.
  4. Refine ranking of the originally retrieved results using the discovered meaningful phrases.
This semantic unit identification can be saved or even computed offline for based on query logs and used for related queries in the future.

No comments:

Post a Comment