- The ClueWeb09 Wiki has been updated with the webgraph, redirects, and language identification information
- A Lucene BM25F implementation
- Jimmy Lin presented the Ivory MapReduce search system.
Thursday, November 19
TREC 2009 This week
The annual TREC meeting is this week in Maryland. The proceedings won't be available until February, but you can get hints about what is happened (but no eval results) by following on Twitter, #trec09. Some highlights:
Evaluating LDA Clustering Output
Yesterday, I mentioned that Mahout has an implementation of LDA, a form of clustering.
Today, there is a post on the LingPipe blog covering a recent paper, Reading Tea Leaves: How Humans Interpret Topic Models. Read the post for an overview of what the authors found when they used Mechanical Turk to evaluate the coherence of topic-document and topic-word clusters.
Today, there is a post on the LingPipe blog covering a recent paper, Reading Tea Leaves: How Humans Interpret Topic Models. Read the post for an overview of what the authors found when they used Mechanical Turk to evaluate the coherence of topic-document and topic-word clusters.
Microsoft Pivot: A vizualization tool for faceted exploration
Yesterday Microsoft Live Labs launched Pivot. Pivot is a desktop application for faceted navigation and visualization to explore collections of information. Watch the YouTube demo.
I don't have invitation for the tech preview, so you'll have to watch the demo for more details.
I don't have invitation for the tech preview, so you'll have to watch the demo for more details.
Wednesday, November 18
Apache Mahout 0.2: MapReduce LDA, Random Forests, and Frequent ItemSet Miner
Grant announced on the Lucid Imagination blog that Mahout 0.2 is released. Mahout is a library of scalable (distributed) machine learning algorithms using MapReduce.
Mahout 0.2 has several key new features that are worth taking a look at:
Mahout 0.2 has several key new features that are worth taking a look at:
- Latent Dirichlet Allocation (LDA) (JIRA information) - LDA is a form bayesian topic modeling; a type of clustering that discovers hidden "topics" from a collection of documents. See the original LDA paper by Blei, et. al. and the Mallet LDA implementation from UMass (not MapReduce).
- K Nearest Neighbor (KNN) and Singular Value Decomposition (SVD) based recommender - JIRA. The implementation is based on a paper by the Netflix prize winning team: Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
- Random Forests -(JIRA) An ensemble classifier. See Breiman's description for background on Random Forests.
- Frequent Itemset Pattern Miner - (JIRA) - An algorithm that analyzes co-occurrence of items in a basket to suggest new items. This is an implementation of the Parallel FP Growth algorithm.
Monday, November 16
New Yahoo! Research Demo: Quest for NLP based Q&A exploration
Hugo Zaragoza let me know that Yahoo! research has a new demo out, Quest. Quest is a faceted navigation interface on Q&A data. It lets you browse using key phrases, nouns, and verbs extracted from a dependency parse of the questions.
For their description, you can read the announcement on the Y! Sandbox. The demo uses a set of 8 million Q&A documents from Yahoo! Answers collected in 2007. Here's their description of some of the challenges they faced:
I tried it for pasta and then filtered to "pasta salad" I was hoping that some of the nouns would include common ingredients: bacon, chicken, olives, onion, pepperoni, mozzarella cheese, etc... However, most of the nouns/verbs are more general and somewhat redundant given my selected filters. I think the algorithm to select the terms could still be improved.
Faceted search interfaces are important browsing tools, and automatically extracting and selecting facets is a challenging problem. It's good to see first steps applying NLP to the task. I look forward to seeing how Quest evolves.
Be sure to check out the Correlator demo if you haven't seen it.
For their description, you can read the announcement on the Y! Sandbox. The demo uses a set of 8 million Q&A documents from Yahoo! Answers collected in 2007. Here's their description of some of the challenges they faced:
The first one is to select the right "lexical units" of the collection in order to produce meaningful browsing suggestions. The next challenge is to develop interesting list suggestions, on the fly, for whatever query the user may submit. Lastly, we had to invent an interface that would allow users to interact with the suggestions and the results, and enable a natural browsing experience.They used the DeSR dependency parser to extract terms and phrases and then use a forward index with Archive4J to count and sort the terms in the questions that are returned by a query.
I tried it for pasta and then filtered to "pasta salad" I was hoping that some of the nouns would include common ingredients: bacon, chicken, olives, onion, pepperoni, mozzarella cheese, etc... However, most of the nouns/verbs are more general and somewhat redundant given my selected filters. I think the algorithm to select the terms could still be improved.
Faceted search interfaces are important browsing tools, and automatically extracting and selecting facets is a challenging problem. It's good to see first steps applying NLP to the task. I look forward to seeing how Quest evolves.
Be sure to check out the Correlator demo if you haven't seen it.
Thursday, November 12
Machine Learning Talk: Lee Spector on Genetic Programming; applications to Learning Ranking Functions
Today at the Yahoo! sponsored machine learning lunch, Lee Spector presented his work on genetic programming. His talk, Expressive Languages For Evolved Programs highlighted his work using the Push programming language for solving interesting and hard real-world problems.
He pointed to two key principles that these systems need to have to learn solutions, based on observations from biology:
I think there's still interesting work combining GP with IR. For example, one problem is that collections and users evolve over time, but most ranking functions are static.
He pointed to two key principles that these systems need to have to learn solutions, based on observations from biology:
- Meaningful variation - Variations can't just be random, the mutations and selections have to produce meaningful effects in the domain.
- Heritability - children need the ability to inherit desirable features from the parent without being clones.
I think there's still interesting work combining GP with IR. For example, one problem is that collections and users evolve over time, but most ranking functions are static.
Monday, November 2
New York Times Releases Subject Headings Data
Evan Sandhaus announced in a NYTimes Open blog post that they are opening up the NYT subject headings. Today they are announcing the release of the first batch of 5,000 headings.
Also, check out the NYT Article Search API.
Over the last several months we have manually mapped more than 5,000 person name subject headings onto Freebase and DBPedia. And today we are pleased to announce the launch of http://data.nytimes.com and the release of these 5,000 person name subject headings as Linked Open Data.
Over the next few months they plan to expand this to over 30,000 tags.
Subscribe to:
Posts (Atom)
