Wednesday, October 26

Google Clustering and Classification

I must've missed this post on Battelle's SearchBlog:
Peter Norvig's Demo at Web 2.0
Here is SE Watch's coverage.

Other web 2.0 highlights.

Named Entity Extraction, Clustering, and new UI innovations? Talk about a lot to bite off in one presentation...

Here is something he said that I found really interesting that I have been pondering recently:
We break text into sentences and then match sentences against patterns. We discard noisy data and regularize over names. We also use the relations of concepts and the nesting sets of concepts to understand the concepts. -- Talk about a teaser. There is so much there that it is almost meaningless. I really want to be interested, to learn something from that, but Peter is a master of marketing speak -- lots of words that are interesting, without much substance. How?? How do you break sentences up? What are the patterns? What is the noise! No secret sauce.

Is this presentation online somewhere? I am dying to see it. I don't think Google has published papers on much (or any!) of this.

I managed to find an MP3: on the unofficial google blog. I'll try to listen soon.

There is some interesting working going on to identify and remove irrelevant parts of a page (remove things like navigational links, targeted ads, etc..) and focus on the content.

What I would like to know is: What are Google's categories for Bayesian classification? Were the categories 'automatically discovered' a la Clusty or pages classified into a more manual taxonomy a la DMOZ and Globalspec.

... there is a lot more here on text classification and ad sense. Peter Norvig wrote the book on AI (literally) so I guess I shouldn't be surprised. Did I mention it is a good book? We used it as the text of my AI course at Union.

Some people & research on text classification, entity extraction, and clustering :
The Bow Toolkit -- library from 1996 in C code. Talk about ancient, but it is an early example. The author, Andrew McCallum is one of the pioneers in entity extraction and former VP of R&D at Whizbang labs and major guy at Flipdog. His research students are publishing some very interesting papers.

Learning to Cluster Web Search Results by MS Asia. They actually have an online demo of their web search clustering technology. It's not bad, but it's a little slow. It's also not perfect. For example a search on Jaguar returns lots of clusters on cars, but only one on MacOS and none on cats, helicopters or fighter jets. Clusty does a much better job.

Word Clustering and disambiguation based on co-occurrence data from MS Research Japan.

If you want to drink from a firehouse, here is Stanford's CS276 class on IR. This class covers most of the above mentioned topics with a plethora of research to keep you busy for awhile. The old course website also has interesting material -- including lecture notes!

Haven't looked too closely at this, but I ran across it in my browsing stream and it looks interesting:
Techniques for Improving the performance of Naive Bayes
... now what about SVMs? Naive Bayes and SVMs seem to be the two most common text classification technologies in vogue today with academia. More on text classification discussion to follow...

