See my related post on Open-Source Search Engine Libraries.
Here are some of the open source NLP and machine learning tools for text mining, information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more. I've tried to roughly group the tools. However, the categories are quite loose and many of the tools fit into multiple categories.
Machine learning and data mining
Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.
Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.
NLP Tools
LingPipe - (not technically 'open-source, see below) Alias-I's Lingpipe is a suite of java tools for linguistic processing of text including entity extraction, speech tagging (pos) , clustering, classification, etc... It is one of the most mature and widely used open source NLP toolkits in industry. It is known for it's speed, stability, and scalability. One of its best features is the extensive collection of well-written tutorials to help you get started. They have a list of links to competition, both academic and industrial tools. Be sure to check out their blog. LingPipe is released under a royalty-free commercial license that includes the source code, but it's not technically 'open-source'.
OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.
Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsing and part of speech tagging from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.
NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.
Dan Bike's Multilingual Statistical Parser - A parallel statistical parsing engine for English, Arabic, Chinese, and soon Korean.
Question Answering
OpenEphyra - is start-of-the-art open framework for Question Answering. It is a full-featured, end-to-end system for QA written in Java and developed at CMU's LTI. It is released on the GNU GPL license.
Information Extraction
Mallet - Mallet is a collection of tools in Java for statistical NLP: text classification, clustering and IE. It was created by Andrew Mccallum's information extraction lab at UMass. (Bow and Rainbow are pre-cursors written in C while he was at CMU.). Mallet is one of the leading academic tools for text classification, topic modeling, and sequential tagging using Conditional Random Fields (CRFs).
MinorThird - Another toolkit for text classification and entity extraction, by William Cohen at CMU. It has some notable differences from the other toolkits mentioned, see the page for details (I'm not as familiar with this one, so I'm taking his word for it.).
GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. One of the components it is distributed with is ANNIE, which stands for "A Nearly-New IE system." It is maintained by the NLP group at the University of Sheffield.
Wordnet Interfaces
Wordnet is a lexical database of English terms and their relationships to one another developed at Princeton. It's is often used as an external knowledge resource in retrieval experiments, although Wikipedia is becoming a more popular external resource because it is more comprehensive and up-to-date.
Java Wordnet Library (JWNL) - A java library for accessing Wordnet. It is one of the more popular libraries, used by OpenEphyra and other systems.
MIT Java Wordnet Interface (JWI) - A java interface for accessing Wordnet versions 1.6 to 3.0. The latest release is 2.5.1 released in Dec. 2008.
String Similarity
SecondString - A collection of approximate string matching tools (for those record linkage problems), it also has an implementation of the Jaro-Winkler string distance metric. This is written by William Cohen from CMU.
Simmetrics - Another string similarity package. This is maintained by Sheffield University (the makers of the aforementioned GATE IE package).
Lingpipe (mentioned above) also contains string similarity tools.
(updated 1/21/2008, 11/6/2008, 12/18/2008. This post is expanding beyond 'mining' to include other NLP tools)
Hopefully you found this list helpful, it was useful organizing my bookmarks.
Friday, March 16
Subscribe to:
Post Comments (Atom)

16 comments:
Hi,
It is a useful summary thanks. I studied GATE for a month or so to get familiar with its functionalities. But finally I decided to write my own code and use command line tools or other people snippet because if it is not working at least it is your own code and you know how to modify it but I found undestanding the gate classed more difficult than starting from scratch. Actually I need a transducer so I tried JAPE and a few other transducers that come with GATE but they all had some weaknesses for example JAPE does not suppot empty tags. Anyway I finally decided to make my own framework and add external parts to it rather than relying on gate.
ajoorabchi@hotmail.com
Great work Jeff..I found it really useful.Thanks!
It is a really useful summary about text mining tools .Greatly thanks!
Great work! Thank you very much!
Great list of resources Jeff!
UIUC's Cognitive Computation Group also has a suite of Java-based NLP tools:
http://l2r.cs.uiuc.edu/~cogcomp/software.php
Hi Jeff,
I found this really usefull. Great Work...!!!!
Many thanks...
R
very nice man ...
pretty useful.
thanks.
Great listing, Jeff!
For sequential learning (useful for NLP tasks such as chunking, POS tagging or information extraction) two tools in C++ developed by Taku Kudo are higly recommended: YamCha (http://chasen.org/~taku/software/yamcha/) and CRF++ (http://crfpp.sourceforge.net/). Both have similar interfaces, are open source, have small footprint, and have pretty good documentation.
Thanks for including LingPipe.
While we do distribute LingPipe with source on the web without registration, we don't have a standard "open source" license. Here's a link to LingPipe's Royalty-free license.
Like MySQL, we own our IP and can negotiate licenses on a per-customer basis.
Hello I am Working on "Text mining and its potential applicatons" for my M.Phil Degree. Please inform me how to develope an information extraction system. I want to extract the unknown information from textual mater. (For ex. if C is Language,the system must be able to decide that Java is also a programming language)
Please help me.
Mr. U.S. Patki
Dept of Computer Science
Science College Nanded (M.S.)
India.
E-Mail uspatki@yahoo.com
Excellent overview, Jeff. Thanks for collecting this insight. I realize the point here is to cover open-source packages, but web services that offer NLP functionality might also be included in the list, especially if they're free.
We've had success using OpenCalais (http://opencalais.com), which is a commercial service but one with a pretty generous free offering, for entity extraction and document classification. It accepts text in a variety of formats (e.g. HTML) and transfer methods (e.g. http/REST), returning the results of several NLP processes in XML, JSON, RDF, etc. Their free service makes it easy for early-stage startups to get good results quickly.
Thanks again,
Mark Soper, Likematter
Hi Jeff,
Excellent overview,I found this really useful.
Thanks
Ajit
Thanks Jeff!
I just wanna say that Dan's last name is BikeL ;-)
Thanks Jeff, this is absolutely great :)
Thanx man..
Am a student doin graduation from Cochin University,India. Can u help me to create a project based on web data mining using support vector machines..
Nimish
Post a Comment