Friday, September 17

A lesson in defining topic-based communities

There is a post on the stack overflow blog on how they are managing communities, Merging Season. At the heart of discussion: What is the right size of domain for a topic-based community? They are against one giant community, as they say:
Yahoo! Answers. Monumentally popular, enormous traffic, and containing absolutely no useful information, Yahoo! Answers is actually more of a teenage chat room than a place to get real answers.
They also highlight failed attempts to bring the Ubuntu and Unix community sites together to make a single community. The process of defining a "topical community" reminds me of the problems we have in IR when we define a "topic based vertical" to apply domain knowledge in retrieval. From their blog:
Communities consist of concentric circles. You share more with people in the inner circle than you do with people in the outer circles, but if you were in a strange place, you’d seek out people even from the larger circles. If you’re building a community (or a Stack Exchange site), it’s not immediately obvious which level is going to work...
They are developing rules that use the size and degree of overlap between communities to guide the process. It will be interesting how this plays out and what lessons we can use to apply to IR.

Monday, September 13

Google Scribe: Autocomplete beyond queries

Overshadowed by the Google Instant last week, a labs project called Google Scribe was launched. See some information on the help page.

Here is an example what it did with the initial words and accepting all further suggestions:

Jeff Dalton is a researcher at the University of California at Berkeley...

An amusing example. I'm actually quite surprised it autocompleted "researcher" correctly. However, Scribe got the university wrong. It looks like UC Berkeley wins the web popularity contest.

Overall, Scribe appears to be a straightforward application of web n-gram language models covered in an AJAX interface. Some of its mistakes demonstrate the drawbacks of not utilizing long range word dependencies and topical context. Still, an interesting step towards more autocompletion. I think there may be interesting opportunities to improve effectiveness by leveraging custom language models built from my other documents and web history.