Structured Data Meets the Web: A Few Observations.
The article goes into detail about the "Deep Web" (also referred to as the "hidden web") and the different strategies for organizing this type of content and making it searchable. (See previous work in on surfacing content, Downloading Hidden Web Content and Crawling The Hidden Web). Greg has a nice summary of their deep web discussion, with the conclusion that Google prefers to bring the data in-house rather than creating a federated ("meta") search system and submitting queries to other engines.
Organizing a database of everything
The paper says that domain specific search engines that utilize structure effectively lead to better ranking and refinement of search results (as seen in travel, jobs, local services, Engineering, etc...). However, scaling information extraction and data integration beyond a few well defined domains is hard.
One could argue that the problem is simply that there are many domains (perhaps a few hundred) and we should simply model them one by one. However, the issue is much deeper. Data on the Web encompasses much of human knowledge, and therefore it is not even possible to model all the possible domains. Furthermore, it is not even clear what constitutes a single domain. Facets of human knowledge and information about the world are related in very complex ways, making it nearly impossible to decide where one domain ends and another begins...The solution proposed in the paper to integrating all of these domains is not to create an integrated "super schema" that models all the relationships, but instead to create a system that supports multiple overlapping schemas and that models uncertainty and probability in every phase: 1) mapping keyword queries to structured queries on data feeds, 2) mapping between schemas from different data sources, 3) model uncertainty about the source of the actual value of the data in the system and its source. Thus, you have an idea of how close a user's query may be to a type of structured data, how the data relates to other similar data, and how reliable the data is.
In this system described schemas from different domains are not merged, but rather co-exist, thus sidestepping major data integration problems.
Instead of necessarily creating mappings between data sources and a virtual schema, we will rely much more heavily on schema clustering. Clustering lets us measure how close two schemas are to each other, without actually having to map each of them to a virtual schema in a particular domain. As such, schemas may belong to many clusters, thereby gracefully handling complex relationships between domains. Keyword queries will be mapped to clusters of schemas, and at that point we will try to apply approximate schema mappings in order to leverage data from multiple sources to answer queries.In some ways this is similar to a "folksonomy approach" where tags have structure, imagine if Flickr tags let you specify artist, price, or color.
Many of these principles are visible in the design of Google Base. For more details on these principles see Principles of Dataspace Systems and From Databases to Dataspaces a new abstraction for Information Management. Alon also gave a PODS 2006 Keynote on the topic, Crossing the Structure Chasm. Now that you have structured data organized, the next problem becomes integrating structured data search with unstructured data search.
Integrating Search of structured and unstructured data
The article says that structured data must be integrated seamlessly with existing web search. This means:
- Queries will be posed (at least initially) as keywords. Users will not pose complex queries of any form. At best, users will pick refinements by filling out a form) that might be presented along with the answers to a keyword query.
- Queries will be posed from one main search destination, e.g., the main page of a search engine. Users will find it hard to memorize the existence of specialized search engines, especially ones that they use occasionally. The onus is hence on the general search engine to detect the correct user intention and automatically activate specialized searches are re-direct to specialized sources.
- Answers from structured data sources need to (as far as possible) appear along-side regular web-search results. Ideally, they should not be distinguished from the other results. While the research community might care about the distinction between structured and un-structured data, the vast majority of search users do no appreciate the distinction and only care if the results meet their requirement.
There are a lot of open research problems in the above paragraphs in UI design (faceted search), query routing, and query mapping (mapping keyword queries to structured queries). Hopefully we will see more on research on these problems in the future and see how they play out in Google Base.
Don't miss Alon's Blog, and specifically his recent posts (Structured Data and the Web) and (Uncertainty and Data Integration) on this topic.