Saturday, March 17

Open source collaborative filtering and recommendation systems

Update 3/3/2010: Added Mahout

Yesterday I posted on open source text mining libraries. Today, I am looking at recommendation systems aka collaborative filtering (CF); mining user behavior and harnessing the "wisdom of the crowds." In a nutshell, recommendation systems discover new items you might be interested based on your past preferences (such as explicit ratings or implicit click behavior). Their goal is to bring you new and more importantly, interesting, information without you searching for it.

Background from Amazon
Since we are talking about recommendations, the first stop is Amazon and the creator of its original system, Greg Linden, author of Geeking with Greg. (I had the opportunity to meet Greg at SIGIR this past summer and we had some great discussion during the poster session.) Greg's "Early Amazon" posts really provide fascinating insight into some of Amazon's early days. The Amazon recommendation system started as a side project that he wasn't supposed to be working on, read the full story and don't miss his earlier story on his first attempt at a system, BookMatcher.

Current Systems
Recently, a lot of work on distributed recommendation systems is happening in Apache Mahout, a distributed machine learning library that uses Hadoop. The Taste recommender was incorporated into it. The first version was originally started as work on the NetFlix contest. (via Greg). The Mahout library has support for KNN, SVD, and Frequent Pattern Mining using Parallel FP-Growth. Some of the recommendation algorithms are more mature than others: so you'll be getting your hands dirty getting some of them to work. Despite it lack of maturity, this would be my first stop if I was building a system today.

A simple content based recommender could be built using a search system to take an object and convert it into a query. See the open-source search engines.

Other Related Work
Another specialist in this area is Daniel Lemire, a researcher at the University of Quebec in Montreal. He wrote this paper on a simple and effective recommendation engine using SQL and PHP, the code is available on the site. There is a related project, Vogoo in PHP which appears to be actively maintained. Daniel also wrote a version of the item based recommender engine in Java, Cofi.

CoFE (Collaborative Filtering Engine) is another open source Java based engine created by Jon Herlocker from the University of Oregon, but I don't believe it is being maintained; it looks like it hasn't been updated since 2004.

Ray Mooney at the University of Texas has also been working on recommendation research as well, his main specialty is information extraction and machine learning. Here are some of his department's publications. Specifically, here are some introductory level slides from a recent course he taught on Information Retrieval.

That pretty much covers recommender systems for today. You can always check the Wikipedia article on Collaborative Filtering (CF) for updates. Again, many of these systems use machine learning and classification, which fits nicely with my previous post on text mining.

Friday, March 16

Java Open Source NLP and Text Mining tools

See my related post on Open-Source Search Engine Libraries.

Here are some of the open source NLP and machine learning tools for text mining, information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more. I've tried to roughly group the tools. However, the categories are quite loose and many of the tools fit into multiple categories.

Machine learning and data mining
Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.

NLP Tools
LingPipe - (not technically 'open-source, see below) Alias-I's Lingpipe is a suite of java tools for linguistic processing of text including entity extraction, speech tagging (pos) , clustering, classification, etc... It is one of the most mature and widely used open source NLP toolkits in industry. It is known for it's speed, stability, and scalability. One of its best features is the extensive collection of well-written tutorials to help you get started. They have a list of links to competition, both academic and industrial tools. Be sure to check out their blog. LingPipe is released under a royalty-free commercial license that includes the source code, but it's not technically 'open-source'.

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.

Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsing and part of speech tagging from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.

OpenFST - A package for manipulating weighted finite state automata. These are often used to represented a probablistic model. They are used to model text for speech recognition, OCR error correction, machine translation, and a variety of other tasks. The library was developed by contributors from Google Research and NYU. It is a C++ library that is meant to be fast and scalable.

NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.

Dan Bikel's Multilingual Statistical Parser
- A parallel statistical parsing engine for English, Arabic, Chinese, and soon Korean.

Question Answering
OpenEphyra - is start-of-the-art open framework for Question Answering. It is a full-featured, end-to-end system for QA written in Java and developed at CMU's LTI. It is released on the GNU GPL license.

Information Extraction
Mallet - Mallet is a collection of tools in Java for statistical NLP: text classification, clustering and IE. It was created by Andrew Mccallum's information extraction lab at UMass. (Bow and Rainbow are pre-cursors written in C while he was at CMU.). Mallet is one of the leading academic tools for text classification, topic modeling, and sequential tagging using Conditional Random Fields (CRFs).

MinorThird - Another toolkit for text classification and entity extraction, by William Cohen at CMU. It has some notable differences from the other toolkits mentioned, see the page for details (I'm not as familiar with this one, so I'm taking his word for it.).

GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. One of the components it is distributed with is ANNIE, which stands for "A Nearly-New IE system." It is maintained by the NLP group at the University of Sheffield.

Wordnet Interfaces
Wordnet is a lexical database of English terms and their relationships to one another developed at Princeton. It's is often used as an external knowledge resource in retrieval experiments, although Wikipedia is becoming a more popular external resource because it is more comprehensive and up-to-date.

Java Wordnet Library (JWNL) - A java library for accessing Wordnet. It is one of the more popular libraries, used by OpenEphyra and other systems.

MIT Java Wordnet Interface (JWI) - A java interface for accessing Wordnet versions 1.6 to 3.0. The latest release is 2.5.1 released in Dec. 2008.

String Similarity
SecondString - A collection of approximate string matching tools (for those record linkage problems), it also has an implementation of the Jaro-Winkler string distance metric. This is written by William Cohen from CMU.

Simmetrics - Another string similarity package. This is maintained by Sheffield University (the makers of the aforementioned GATE IE package).

Lingpipe (mentioned above) also contains string similarity tools.

(updated 1/21/2008, 11/6/2008, 12/18/2008. This post is expanding beyond 'mining' to include other NLP tools)

Hopefully you found this list helpful, it was useful organizing my bookmarks.