Wednesday, May 16

Behind Universal Search: Advanced Query Routing and Heterogeneous Result Ranking

Unveiling 'Universal Search'
The new Google 'Universal Search' brings all the content types under one roof, or at least more seamlessly blended into one set of search results. Marissa Mayer describes the change:
The ultimate goal of universal search is to break down the silos of information that exist on the web and provide the very best answer every time a user enters a query. While we still have a long way to go, today’s announcements are a big step in that direction... Google’s vision for universal search is to ultimately search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results that offers users precisely what they are looking for.
Searching across all of the different heterogeneous content types and ranking all of the results in real time is hard and expensive.

You can read more on Google's Blog posts. Behind the scenes with universal search and Universal search: The best answer is still the best answer. Other coverage on Search Engine Land: Google 2.0: Google Universal Search.

Query Routing and Heterogeneous Result Ranking
So, how do they do they do that? Well, I don't have exact answers, but there are some clues. A good place to start looking are recent papers by University of Washington alumni (and now Googlers) Alon Halevy, Jayant Madhavan, and company on integrating web search with Google Base:

Web-scale Data Integration: You can only afford to Pay As You Go
See specifically sections 3.1 and 3.2 for how Google starts begin to go about performing the query mapping, heterogeneous result ranking, source ranking, and generated structured queries from unstructured queries (i.e. Britney spears is a person, a musician and performer and so music and video results may be relevant). Much of this computation can be done off-line for common queries and feedback can be collected on relevance using user behavior:
The prime example of implicit feedback is to record which of the answers presented to the user are clicked on. Such measurements, in aggregate, offer significant feedback to our source selection and ranking algorithm. Observing which sources are viewed in conjunction with each other also offers feedback on whether sources are related or not. Finally, when the user selects a refinement of a given query, we obtain feedback on our query structuring mechanisms.
For more on Query Routing and integrating results from heterogeneous sources see also the following:

Query Routing for Web Search Engines: Architecture and Experiments also at UW by Oren Etzioni.

Also, Alon Halevy and others working on Google Base also talk briefly about these problems again in: Structured Data Meets the Web: A Few Observations ; see Section 3 Integrating Structured and Unstructured Data. As an introduction:
The reality of web search characteristics dictates the following principle to any project that tries to leverage structured data in web search: querying structured data and presenting answers based on structured data must be seamlessly integrated into traditional web search.
Sounds like Universal Search to me. See my previous post Integrating a Database of Everything with Web Search for more details on that paper.

The above papers doesn't speak to Universal Search directly; they are mostly relate to selecting different types of objects from Google Base. However, the same principles can be applied for integrating other kinds of heterogeneous data with web content as well.

Data 'Silos' Continue to Abound
It is worth noting that more and more organizations are finding themselves with a similar problem: lots of data silos to search and heterogeneous data to rank. For example at Globalspec, we have the same problem as Google. In fact we just unveiled a new one today, PartFinder for part number search. At Globalspec we have a multitude of different content types mixing structured and unstructured content. We have our 'content petals', the Engineering Web, parts and services, engineering news, patents, material properties, etc... figuring out what content is most appropriate for a given query is an important non-trivial task.

Pushing towards true 'Universal Search' is a grand vision and it won't happen over night. Google is taking steps in the right direction with this latest update and I'm sure other search engines will follow.