Friday, March 27

Statistical Learning of Semantics from Web Data

Greg wrote a post on an article in the April 2009 IEEE Intelligent Systems, The Unreasonable Effectiveness of Data by Alon Halevy, Peter Norvig, and Fernando Pereira. It's on a similar talk as Peter's CIKM '08 industry day talk, Statistical Learning as the Ultimate Agile Development Tool.

In it the Googlers cover statistical learning of semantic interpretations from large quantities of information. They highlight the TextRunner project and Michael Cafarella's related work at UW extracting schema from tables on the web. They also highlight Marius Pasca's work, Organizing and Searching the World Wide Web of Facts. Step Two: Harnessing the Wisdom of the Crowds, which demonstrates extracting entity classes from free web text and large query logs.

A few excerpts. First, on leveraging the schemas extracted from the myriad of tables on the web:
What we need are methods to infer relationships between column headers or mentions of entities in the world. These inferences may be incorrect at times, but if they’re done well enough we can connect disparate data collections and thereby substantially enhance our interaction with Web data. Interestingly, here too Web-scale data might be an important part of the solution. The Web contains hundreds of millions of independently created tables and possibly a similar number of lists that can be transformed into tables. These tables represent structured data in myriad domains.
In the end they advise:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail... See how far you can go by tying together the words that are already there, rather than by inventing new concepts with clusters of words.
If you're looking for Big Data, two good starting places are the new billion document web corpus and the Million Book Project.

Thursday, March 26

TREC Billion Document Web Corpus Available

Exciting news!

The corpus of 1 billion web documents ClueWeb09 is now available. The upcoming TREC 2009 will use it. You can also see the crawl stats. It ships on 4 1.5 TB hard drives.

Let the fun begin!

Tuesday, March 24

Greg Linden on evaluating recommendation systems like search

Greg Linden, creator of the original Amazon recommendation system, wrote an article on the new Communications of the ACM (CACM) blog on the difficulty evaluating recommendation systems.

Specifically, he picks on root-mean squared error (RMSE), the metric being used to evaluate the Netflix prize. He parallels it with web search, where you really want to find the top N best movies for you:
Web search engines primarily care about precision (relevant results in the top 10 or top 3). They only care about recall when someone would notice something they need missing from the results they are likely to see. Search engines do not care about errors scoring arbitrary documents, just their ability to find the top N documents.
Greg would welcome your thoughts and comments, and the new CACM blog could use more subscribers.

Google concept-based related searches and dynamic KWIC

Google announced on their blog that they are making improvements to related searches:

The first highlight is that Google is using the relatedness of terms:
For example, if you search for [principles of physics], our algorithms understand that "angular momentum," "special relativity," "big bang" and "quantum mechanics" are related terms that could help you find what you need.
SearchEngineLand's has a post on the topic, which provides very high-level overview of how it works. The creator, Ori Allon, created technology called Orion that was acquired by Google in 2006. However, I can't seem to find any publications describing the details, which is a bit unusual for academic work.

In other news, Google is also adjusting its snippets to be sensitive to query length: longer queries may return longer keyword-in-context snippets.
When you enter a longer query, with more than three words, regular-length snippets may not give you enough information and context. In these situations, we now increase the number of lines in the snippet to provide more information and show more of the words you typed in the context of the page
These are small changes, but still interesting.