Friday, March 16

Java Open Source NLP and Text Mining tools

See my related post on Open-Source Search Engine Libraries.

Here are some of the open source NLP and machine learning tools for text mining, information extraction, text classification, clustering, approximate string matching, language parsing and tagging, and more. I've tried to roughly group the tools. However, the categories are quite loose and many of the tools fit into multiple categories.

Machine learning and data mining
Weka - is a collection of machine learning algorithms for data mining. It is one of the most popular text classification frameworks. It contains implementations of a wide variety of algorithms including Naive Bayes and Support Vector Machines (SVM, listed under SMO) [Note: Other commonly used non-Java SVM implementations are SVM-Light, LibSVM, and SVMTorch]. A related project is Kea (Keyphrase Extraction Algorithm) an algorithm for extracting keyphrases from text documents.

Apache Lucene Mahout - An incubator project to created highly scalable distributed implementations of common machine learning algorithms on top of the Hadoop map-reduce framework.

NLP Tools
LingPipe - (not technically 'open-source, see below) Alias-I's Lingpipe is a suite of java tools for linguistic processing of text including entity extraction, speech tagging (pos) , clustering, classification, etc... It is one of the most mature and widely used open source NLP toolkits in industry. It is known for it's speed, stability, and scalability. One of its best features is the extensive collection of well-written tutorials to help you get started. They have a list of links to competition, both academic and industrial tools. Be sure to check out their blog. LingPipe is released under a royalty-free commercial license that includes the source code, but it's not technically 'open-source'.

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, part-of-speech tagging, chunking and parsing, named-entity detection, and co-reference analysis using the Maxent machine learning package.

Stanford Parser and Part-of-Speech (POS) Tagger - Java packages for sentence parsing and part of speech tagging from the Stanford NLP group. It has implementations of probabilistic natural language parsers, both highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. It's has a full GNU GPL license.

OpenFST - A package for manipulating weighted finite state automata. These are often used to represented a probablistic model. They are used to model text for speech recognition, OCR error correction, machine translation, and a variety of other tasks. The library was developed by contributors from Google Research and NYU. It is a C++ library that is meant to be fast and scalable.

NTLK - The natural language toolkit is a tool for teaching and researching classification, clustering, speech tagging and parsing, and more. It contains a set of tutorials and data sets for experimentation. It is written by Steven Bird, from the University of Melbourne.

Dan Bikel's Multilingual Statistical Parser
- A parallel statistical parsing engine for English, Arabic, Chinese, and soon Korean.

Question Answering
OpenEphyra - is start-of-the-art open framework for Question Answering. It is a full-featured, end-to-end system for QA written in Java and developed at CMU's LTI. It is released on the GNU GPL license.

Information Extraction
Mallet - Mallet is a collection of tools in Java for statistical NLP: text classification, clustering and IE. It was created by Andrew Mccallum's information extraction lab at UMass. (Bow and Rainbow are pre-cursors written in C while he was at CMU.). Mallet is one of the leading academic tools for text classification, topic modeling, and sequential tagging using Conditional Random Fields (CRFs).

MinorThird - Another toolkit for text classification and entity extraction, by William Cohen at CMU. It has some notable differences from the other toolkits mentioned, see the page for details (I'm not as familiar with this one, so I'm taking his word for it.).

GATE - one of the leading toolkits for text mining and information extraction. It has a nice GUI. One of the components it is distributed with is ANNIE, which stands for "A Nearly-New IE system." It is maintained by the NLP group at the University of Sheffield.

Wordnet Interfaces
Wordnet is a lexical database of English terms and their relationships to one another developed at Princeton. It's is often used as an external knowledge resource in retrieval experiments, although Wikipedia is becoming a more popular external resource because it is more comprehensive and up-to-date.

Java Wordnet Library (JWNL) - A java library for accessing Wordnet. It is one of the more popular libraries, used by OpenEphyra and other systems.

MIT Java Wordnet Interface (JWI) - A java interface for accessing Wordnet versions 1.6 to 3.0. The latest release is 2.5.1 released in Dec. 2008.

String Similarity
SecondString - A collection of approximate string matching tools (for those record linkage problems), it also has an implementation of the Jaro-Winkler string distance metric. This is written by William Cohen from CMU.

Simmetrics - Another string similarity package. This is maintained by Sheffield University (the makers of the aforementioned GATE IE package).

Lingpipe (mentioned above) also contains string similarity tools.

(updated 1/21/2008, 11/6/2008, 12/18/2008. This post is expanding beyond 'mining' to include other NLP tools)

Hopefully you found this list helpful, it was useful organizing my bookmarks.

40 comments:

  1. Hi,

    It is a useful summary thanks. I studied GATE for a month or so to get familiar with its functionalities. But finally I decided to write my own code and use command line tools or other people snippet because if it is not working at least it is your own code and you know how to modify it but I found undestanding the gate classed more difficult than starting from scratch. Actually I need a transducer so I tried JAPE and a few other transducers that come with GATE but they all had some weaknesses for example JAPE does not suppot empty tags. Anyway I finally decided to make my own framework and add external parts to it rather than relying on gate.

    ajoorabchi@hotmail.com

    ReplyDelete
  2. Great work Jeff..I found it really useful.Thanks!

    ReplyDelete
  3. It is a really useful summary about text mining tools .Greatly thanks!

    ReplyDelete
  4. Anonymous12:22 PM EDT

    Great work! Thank you very much!

    ReplyDelete
  5. Great list of resources Jeff!

    UIUC's Cognitive Computation Group also has a suite of Java-based NLP tools:

    http://l2r.cs.uiuc.edu/~cogcomp/software.php

    ReplyDelete
  6. Hi Jeff,

    I found this really usefull. Great Work...!!!!

    ReplyDelete
  7. very nice man ...
    pretty useful.
    thanks.

    ReplyDelete
  8. Great listing, Jeff!

    For sequential learning (useful for NLP tasks such as chunking, POS tagging or information extraction) two tools in C++ developed by Taku Kudo are higly recommended: YamCha (http://chasen.org/~taku/software/yamcha/) and CRF++ (http://crfpp.sourceforge.net/). Both have similar interfaces, are open source, have small footprint, and have pretty good documentation.

    ReplyDelete
  9. Thanks for including LingPipe.

    While we do distribute LingPipe with source on the web without registration, we don't have a standard "open source" license. Here's a link to LingPipe's Royalty-free license.

    Like MySQL, we own our IP and can negotiate licenses on a per-customer basis.

    ReplyDelete
  10. Hello I am Working on "Text mining and its potential applicatons" for my M.Phil Degree. Please inform me how to develope an information extraction system. I want to extract the unknown information from textual mater. (For ex. if C is Language,the system must be able to decide that Java is also a programming language)
    Please help me.
    Mr. U.S. Patki
    Dept of Computer Science
    Science College Nanded (M.S.)
    India.
    E-Mail uspatki@yahoo.com

    ReplyDelete
  11. Excellent overview, Jeff. Thanks for collecting this insight. I realize the point here is to cover open-source packages, but web services that offer NLP functionality might also be included in the list, especially if they're free.

    We've had success using OpenCalais (http://opencalais.com), which is a commercial service but one with a pretty generous free offering, for entity extraction and document classification. It accepts text in a variety of formats (e.g. HTML) and transfer methods (e.g. http/REST), returning the results of several NLP processes in XML, JSON, RDF, etc. Their free service makes it easy for early-stage startups to get good results quickly.

    Thanks again,

    Mark Soper, Likematter

    ReplyDelete
  12. Hi Jeff,

    Excellent overview,I found this really useful.

    Thanks
    Ajit

    ReplyDelete
  13. Thanks Jeff!
    I just wanna say that Dan's last name is BikeL ;-)

    ReplyDelete
  14. Thanks Jeff, this is absolutely great :)

    ReplyDelete
  15. Thanx man..
    Am a student doin graduation from Cochin University,India. Can u help me to create a project based on web data mining using support vector machines..

    Nimish

    ReplyDelete
  16. Thanx for the great post ...i really find it useful
    I'm wondering if there is a good tool for word stemming using java ...if you recommend something it would be great

    ReplyDelete
  17. The Lucid Imagination Solr Distribution contains an optimized Java version of the Krovetz stemmer.

    ReplyDelete
  18. Thanks very much jeff .That's really helpful

    ReplyDelete
  19. Hi Jeff,

    Glad I (re)found this post...added it to my recent write-up on open source data mining tools from last Sunday's ACM data mining unconference in the Bay Area (see http://bixolabs.com/oss/open-source-data-mining-tools/)

    -- Ken

    ReplyDelete
  20. The Open Source Text Mining Software RapidMiner features a lot of text preprocessing options and learning techniques like Naive Bayes and Support Vector Machines (SVM, i.e. LibSVM, JMySVM (similar to SVM^light), EvoSM, SMO SVM). Besides its native methods, it also includes all Weka methods. For more details:
    http://www.rapid-i.com/

    ReplyDelete
  21. Anonymous5:07 AM EST

    Thanks Jeff Dalton for this intersting page web, please i want to use the Al-Stem(Darwish) and Light10(Larkey) Stemmers in java (for Arabic language), but i can't find it, you can help me please?

    ReplyDelete
  22. Hi,
    The information provided by you is good.

    As part of my master thesis , iam going to do sentiment analysis.I read some of the documents on this,in one of the document i read through openNLP java api we can inplement the sentiment anlysis.I tried with this api.For a given sentence we are able to POS tagging through openNLP.After that i am not getting what to do.


    If you have any idea on sentiment analysis can you please share with me.

    Thanks,
    Siva

    ReplyDelete
  23. Great work, really useful!
    Thank you!

    ReplyDelete
  24. Anonymous11:36 PM EDT

    Very helpful list, thanks!

    ReplyDelete
  25. Anonymous5:50 AM EDT

    Hi!

    Thanks for the summary, very helpful for newbie like me.
    I've seen you've last updated it back in 2008. Is there a chance you'll update it again? :)

    Thanks,
    Amir

    ReplyDelete
  26. It's probably time for an update, although most of the major ones, lingpipe, nltk, opennlp, and mallet are all still quite relevant. There are a handful of newer projects that I'll try to add...

    ReplyDelete
  27. Anonymous12:55 AM EDT

    any tool out there which does a seemingly counter-intuitive thing: Extract questions from a given text with the assumption that the text contains answers for those generated questions.

    ReplyDelete
  28. Anonymous11:57 AM EDT

    hello sir i want to creat grammar which can tell us that in that sentence which one is verb,adverb,noun and so on.....can u plz help me out???plz......

    ReplyDelete
  29. Wow a great composigtion.Jeff its really helpful

    ReplyDelete
  30. Anonymous7:06 PM EST

    Thank's a lot for the list. very very helpful!

    ReplyDelete
  31. Dharshni9:33 AM EST

    Excellent help for my project...I got the tool i was searching for!!! Thank you so much Jeff...

    ReplyDelete
  32. There is also a nice ML and DM tool called Orange.

    You can get it here: http://orange.biolab.si/

    Br,
    Slavko

    ReplyDelete
  33. Could give some comments about UIMA, and compare with some popular NLP tools you mentioned.
    Thanks a lot

    ReplyDelete
  34. Abdul Jamil8:36 AM EDT

    Will you please guide me how to calculate individual features of kea in java?

    ReplyDelete
  35. how should i categorize the search results to provide personalization..can u pls tel me about that? Thank you

    ReplyDelete
  36. Indrani Gorti4:32 PM EDT

    Very helpful!

    ReplyDelete
  37. Nice Article to see a list of all NLP tools. I might be using some of them for seo

    Thanks a lot

    ReplyDelete
  38. This is great information, thanks’ for share!

    ReplyDelete
  39. I am trying to use the link Apache Lucene Mahout but it seems to me it is broken.

    ReplyDelete