Wednesday, October 26

Google Clustering and Classification

I must've missed this post on Battelle's SearchBlog:
Peter Norvig's Demo at Web 2.0
Here is SE Watch's coverage.

Other web 2.0 highlights.

Named Entity Extraction, Clustering, and new UI innovations? Talk about a lot to bite off in one presentation...

Here is something he said that I found really interesting that I have been pondering recently:
We break text into sentences and then match sentences against patterns. We discard noisy data and regularize over names. We also use the relations of concepts and the nesting sets of concepts to understand the concepts. -- Talk about a teaser. There is so much there that it is almost meaningless. I really want to be interested, to learn something from that, but Peter is a master of marketing speak -- lots of words that are interesting, without much substance. How?? How do you break sentences up? What are the patterns? What is the noise! No secret sauce.

Is this presentation online somewhere? I am dying to see it. I don't think Google has published papers on much (or any!) of this.

I managed to find an MP3: on the unofficial google blog. I'll try to listen soon.

There is some interesting working going on to identify and remove irrelevant parts of a page (remove things like navigational links, targeted ads, etc..) and focus on the content.

What I would like to know is: What are Google's categories for Bayesian classification? Were the categories 'automatically discovered' a la Clusty or pages classified into a more manual taxonomy a la DMOZ and Globalspec.

... there is a lot more here on text classification and ad sense. Peter Norvig wrote the book on AI (literally) so I guess I shouldn't be surprised. Did I mention it is a good book? We used it as the text of my AI course at Union.

Some people & research on text classification, entity extraction, and clustering :
The Bow Toolkit -- library from 1996 in C code. Talk about ancient, but it is an early example. The author, Andrew McCallum is one of the pioneers in entity extraction and former VP of R&D at Whizbang labs and major guy at Flipdog. His research students are publishing some very interesting papers.

Learning to Cluster Web Search Results by MS Asia. They actually have an online demo of their web search clustering technology. It's not bad, but it's a little slow. It's also not perfect. For example a search on Jaguar returns lots of clusters on cars, but only one on MacOS and none on cats, helicopters or fighter jets. Clusty does a much better job.

Word Clustering and disambiguation based on co-occurrence data from MS Research Japan.

If you want to drink from a firehouse, here is Stanford's CS276 class on IR. This class covers most of the above mentioned topics with a plethora of research to keep you busy for awhile. The old course website also has interesting material -- including lecture notes!

Haven't looked too closely at this, but I ran across it in my browsing stream and it looks interesting:
Techniques for Improving the performance of Naive Bayes
... now what about SVMs? Naive Bayes and SVMs seem to be the two most common text classification technologies in vogue today with academia. More on text classification discussion to follow...

1 comment:

  1. San san nhi lai, bão trứ nhĩ hỉ hoan đích
    Cầm, kim dạ canh gia thanh lệ .
    Thiên lí thanh huy hạ, dữ nguyệt tương tích .
    Phủ nhất khúc diêu tương kí, thiết thiết, lí diện ngã tâm vu kiên . ”

    Tiếng đàn cất , giọt lệ rơi , mặc dù không phải Hải nguyệt Thanh huy cầm , nhưng lúc này , hai tay Tần Thương đã án trên dây đàn một khắc , tư lự đắm chìm trong dĩ vãng .
    đồng tâm
    game mu
    cho thuê phòng trọ
    cho thuê phòng trọ
    nhac san cuc manh
    tư vấn pháp luật qua điện thoại
    văn phòng luật
    số điện thoại tư vấn luật
    dịch vụ thành lập doanh nghiệp
    Nguyệt thần thủ hộ Âm Trúc không có mặc , nhưng tâm linh thủ hộ cùng với sinh mệnh thủ hộ thì hắn đã đeo ở dây lưng . Mặc dù hôm nay , Ny Na đã biết hắn là đệ tử của Tần gia gia thì có vẻ rất hung dữ , nhưng từ trên người nàng , ngoại trừ cảm giác bi thương ở bên ngoài , hắn rõ ràng cảm nhận được Ny Na nãi nãi đối với Tần gia gia của hắn có hoài niệm nhớ thương . Ny Na gây cho hắn một loại cảm giác như thân nhân , rất thân thiết .

    Tâm linh thủ hộ là một cái hạng liên màu bạc , do một loại kim chúc chú vô danh tạo thành ,