Friday, March 16

Java Open source Text Mining and Information Extraction tools

Here are some of the open source tools for text mining: information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more.

Weka - is a collection of machine learning algorithms for data mining. It is probably the most widely used text classification framework. It has implemented a wide variety of algorithms including Naive Bayes and SVM (listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. Another related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Mallet - Mallet is a collection of tools in Java for statistical NLP, text classification, clustering and IE created by Andrew Mccallum's group at UMass. (Note that Bow and Rainbow are pre-cursors written in C while he was at CMU. Bow is fast and contains implementations for Naive Bayes, k-nearest neighbor, TFIDF, and probabilistic indexing.)

LingPipe - Alias-I's Lingpipe is a java tool for information extraction and data mining (entity extraction, speech tagging, clustering, classification, etc...), not to mention string similarity. It is one of the most mature and widely used open source IE toolkits in industry. Recently, I noticed an informative post on their blog recently on Jaro-Winkler string comparison (developed by the Census Bureau, it is also useful for related "database linkage" problems). They have a good list of links to competition, both academic and industrial tools.

GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. One of the components it is distributed with is ANNIE, which stands for "A Nearly-New IE system." It is maintained by the NLP group at the University of Sheffield.

NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.

UPDATED 1/21/2008 (consolidated tools I missed from another post)

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.

Carrot2 - Open source search result clustering software in Java. It is designed for Lucene and works as an add-on for Nutch. There is a commercial version called Lingo 3G.

Text-Mining.org - Not a tool, but a portal for news and information in the text mining community.

String Similarity
SecondString - A collection of approximate string matching tools (for those record linkage problems), it also has an implementation of the Jaro-Winkler string distance metric. This is written by William Cohen from CMU.

MinorThird - Another toolkit for text classification and entity extraction, by William Cohen at CMU. It has some notable differences from the other toolkits mentioned, see the page for details (I'm not as familiar with this one, so I'm taking his word for it.).

Simmetrics - Another string similarity package. This is maintained by Sheffield University (the makers of the aforementioned GATE IE package).

The University of Sheffield, UMASS Amherst, and CMU have active programs contributing java toolkits in this area. Hopefully you found this list helpful, it was useful organizing my bookmarks. I hope to write on some of them in more detail in future posts.

3 comments:

Arash said...

Hi,

It is a useful summary thanks. I studied GATE for a month or so to get familiar with its functionalities. But finally I decided to write my own code and use command line tools or other people snippet because if it is not working at least it is your own code and you know how to modify it but I found undestanding the gate classed more difficult than starting from scratch. Actually I need a transducer so I tried JAPE and a few other transducers that come with GATE but they all had some weaknesses for example JAPE does not suppot empty tags. Anyway I finally decided to make my own framework and add external parts to it rather than relying on gate.

ajoorabchi@hotmail.com

Sethu said...

Great work Jeff..I found it really useful.Thanks!

Biao said...

It is a really useful summary about text mining tools .Greatly thanks!