Friday, March 27

Statistical Learning of Semantics from Web Data

Greg wrote a post on an article in the April 2009 IEEE Intelligent Systems, The Unreasonable Effectiveness of Data by Alon Halevy, Peter Norvig, and Fernando Pereira. It's on a similar talk as Peter's CIKM '08 industry day talk, Statistical Learning as the Ultimate Agile Development Tool.

In it the Googlers cover statistical learning of semantic interpretations from large quantities of information. They highlight the TextRunner project and Michael Cafarella's related work at UW extracting schema from tables on the web. They also highlight Marius Pasca's work, Organizing and Searching the World Wide Web of Facts. Step Two: Harnessing the Wisdom of the Crowds, which demonstrates extracting entity classes from free web text and large query logs.

A few excerpts. First, on leveraging the schemas extracted from the myriad of tables on the web:
What we need are methods to infer relationships between column headers or mentions of entities in the world. These inferences may be incorrect at times, but if they’re done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data. Interestingly, here too Web-scale data might be an important part of the solution. The Web contains hundreds of millions of independently created tables and possibly a similar number of lists that can be transformed into tables. These tables represent structured data in myriad domains.
In the end they advise:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail... See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words.
If you're looking for Big Data, two good starting places are the new billion document web corpus and the Million Book Project.

No comments:

Post a Comment