Friday, March 23

Open Source Text Mining Tools and Resources - Part II

I realized soon after I posted my previous entry on text mining tools that I missed a few. Here are a few more to keep you busy:

OpenNLP - hosts a variety of java-based NLP tools which perform sentence detection, tokenization, pos-tagging, chunking and parsing, named-entity detection, and coreference using the OpenNLP Maxent machine learning package.

Text-Mining.org - a portal for news and information in the text mining community.

Kernel Machines - a portal on Support Vector Machines and kernel methods. It includes articles, tutorials, news, book references, and much more. If you want to learn about SVM classification, start here.

Carrot2 - Open source search result clustering software in Java. It interoperates with Lucene and as an add-on for Nutch. They have a gone on to create a commercial text clustering software called Lingo 3G.

Lingpipe and the stemming dilemma

Lingpipe has an interesting article on stemming: To Stem or not to Stem that is the question. It opens with:

I’d never thought too hard about stemming per se, preferring to tackle the underlying problem in applications through character n-gram indexing (we solve the same problems facing tokenized classifiers using character language models)...
It then goes on to review some of the interesting papers on stemming, concluding that the verdict on stemming is still quite inconclusive.

Carp's bottom line:
What’s a poor computational linguist to do? One thing I’d recommend is to index both the word and its stem. If you’re using a TF/IDF-based document ranker, such as the one underlying Apache Lucene, you can index both the raw word and any number of stemmed forms in the same position and hope IDF sorts them out in the long run...
I agree, stemming can lead to interesting situations where stockings become confused with the stock market... and then searchers get women's lingerie when they are looking for financial information. Tread with care.

Howeve, Carp's solution to index the stem in the same location can be problematic. If you index the raw word and the stemmed form in the same position, there should be a way to differentiate stemmed terms from non-stemmed terms for searching. Words that you have morphed through stemming might be less relevant than the original term, and different morphological changes can change the meaning of a word in different amounts (pluralization versus a noun-verb conversion). Also, if you do not make them separate terms then the stemming can have a significant impact on the IDF of common root terms -- perhaps leading to root terms becoming quite inconsequential when used on their own. An interesting (equivalent) alternative to stemming is query expansion... more on this to follow.

Thursday, March 22

NLP search and resume-ing my file transfer

Powerset, Google, are you listening?

Today I ran across a frustrating example of polysemy - resume. I was searching for some help getting my linux server to resume file downloads so I searched for

vsftpd file resume

I got back Seth Vidal's Resume. Hmm, not exactly what I was looking for, but perhaps I'll give him a call and ask for his help. Maybe it's a sign from Google. Ahh well. A few more iterations and I found what I was looking for, but even in several of my refinements it seemed impossible to entirely escape Seth.

I thought this was a good example of where I wanted to disambiguate my query -- I did not want someones resume, I wanted to resume my file transfer. A good example of polysemy and where a better understanding of my query would have been nice.

Please, give me a way to disambiguate or do a better job of reading my mind.

Tuesday, March 20

How to reset the value of a Java Array - the fast way

Re-using large arrays in java sux, even with Sun's new JVMs.

The issue is that the naive method: fill an array using Arrays.fill is slow.

Large arrays are expensive to create; there is significant overhead creating and allocating the objects in a large array. Thus, it make sense to re-use these large arrays once they have been created, a simple concept referred to as pooling. But, apparently Java does not provide a way to do this efficiently (at least natively.... hack/trick to follow)... the JVM's JIT is supposed to do the fancy optimization for you.

However, I seem to have found a case where the JIT is not optimizing what should be simple. I realize this is probably platform specific, so my rough platform is a recent Sun JDK, Windows 32-bit, Intel, with the JVM set to run in -server mode (and yes I provided an ample warm-up period for the JIT).

I recently profiled an application and discovered that one of the major bottlenecks was Arrays.fill! Looking at the code for Sun's Arrays.Fill, I saw that all it does is simple, naive, loop over the contents of the array and assign the same value again and again. This was not what I expected, I expected to see a native call, like System.ArrayCopy. Another programmer on my team commented that in C it is easy to reset the value of an array, just use memset to set the entire contents of the array to a single value. Hmm, Sun's implementation appears to be sub-optimal, but the JIT should've fixed it.

Me and my co-worker Googled around to see if others found similar problems and we found an old IBM Research article from 1999 on the topic Java server performance: A case study of building efficient, scalable Jvms:
"There is no corresponding memset operation in the Java language specification, but it is still necessary and common to set an entire array to a single value. This shortcoming has been a sore spot for Java application writers, and a number of clever techniques have been introduced to improve the efficiency of initializing a primitive array to a common value. The most common technique used is to initialize a smaller piece of the array and use the System.arraycopy call to fill in the rest of the array in an expanding binary fashion."
The authors go on to say that the IBM JVM's JIT is smart enough to convert the array copy to a memset operation. So, here is the solution described by the researchers above. This code is from the Java Programmer's FAQ - Part B:

public static void bytefill(byte[] array, byte value) {
int len = array.length;
if (len > 0)
array[0] = value;
for (int i = 1; i < len; i += i) {
System.arraycopy( array, 0, array, i, ((len - i) < i) ? (len - i) : i);
}
}

There you go, a java err trick, hack, work around of the day... Any thoughts as to why Sun's new JVMs wouldn't be optimizing this out? Hmm.. perhaps I will try other JDKs and other platforms and see what happens.