Thursday, November 19

TREC 2009 This week

The annual TREC meeting is this week in Maryland. The proceedings won't be available until February, but you can get hints about what is happened (but no eval results) by following on Twitter, #trec09. Some highlights:
Keep up the news Ian and Iadh!

Evaluating LDA Clustering Output

Yesterday, I mentioned that Mahout has an implementation of LDA, a form of clustering.

Today, there is a post on the LingPipe blog covering a recent paper, Reading Tea Leaves: How Humans Interpret Topic Models. Read the post for an overview of what the authors found when they used Mechanical Turk to evaluate the coherence of topic-document and topic-word clusters.

Microsoft Pivot: A vizualization tool for faceted exploration

Yesterday Microsoft Live Labs launched Pivot. Pivot is a desktop application for faceted navigation and visualization to explore collections of information. Watch the YouTube demo.

I don't have invitation for the tech preview, so you'll have to watch the demo for more details.

Wednesday, November 18

Apache Mahout 0.2: MapReduce LDA, Random Forests, and Frequent ItemSet Miner

Grant announced on the Lucid Imagination blog that Mahout 0.2 is released. Mahout is a library of scalable (distributed) machine learning algorithms using MapReduce.

Mahout 0.2 has several key new features that are worth taking a look at:
The release also has many other bug fixes and improvements. Keep up the good work guys!

Monday, November 16

New Yahoo! Research Demo: Quest for NLP based Q&A exploration

Hugo Zaragoza let me know that Yahoo! research has a new demo out, Quest. Quest is a faceted navigation interface on Q&A data. It lets you browse using key phrases, nouns, and verbs extracted from a dependency parse of the questions.

For their description, you can read the announcement on the Y! Sandbox. The demo uses a set of 8 million Q&A documents from Yahoo! Answers collected in 2007. Here's their description of some of the challenges they faced:
The first one is to select the right "lexical units" of the collection in order to produce meaningful browsing suggestions. The next challenge is to develop interesting list suggestions, on the fly, for whatever query the user may submit. Lastly, we had to invent an interface that would allow users to interact with the suggestions and the results, and enable a natural browsing experience.
They used the DeSR dependency parser to extract terms and phrases and then use a forward index with Archive4J to count and sort the terms in the questions that are returned by a query.

I tried it for pasta and then filtered to "pasta salad" I was hoping that some of the nouns would include common ingredients: bacon, chicken, olives, onion, pepperoni, mozzarella cheese, etc... However, most of the nouns/verbs are more general and somewhat redundant given my selected filters. I think the algorithm to select the terms could still be improved.

Faceted search interfaces are important browsing tools, and automatically extracting and selecting facets is a challenging problem. It's good to see first steps applying NLP to the task. I look forward to seeing how Quest evolves.

Be sure to check out the Correlator demo if you haven't seen it.