According to Manning and Schutze, A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schutze, 1999). Another definition used by Columbia NLP researcher Frank Smadja is: A collocation is an arbitrary and recurrent word combination (Benson 1990). Term co-occurrence is less strict, it is simply words that happen to occur in the same document.
Collocations are very interesting. They are unique combinations of words that have a special meaning, like "a stiff wind." When used in a technical context they are things like technical terms. For example, the title of this article includes a CS/IR collocation: "information retrieval", another might be "information extraction." For a full discussion of Collocations you can read Chapter 5 from Manning & Schutze's book, which happens to be availabe online.

In my spare time, evening and weekends, I have been writing a Java program that uses the Google API to extract interesting collocations from web search result summaries. It utilizes N-Grams (sequences of words or characters) to identify interesting phrases. Hopefully, I'll have it working well enough to post online soon.
My current approach is not too sophisticated, I am using a list of stop words (a, the, is of, etc...) to remove "uninteresting" phrases. If I have time I would like to pursue a more sophisticated approach that utilizes part of speech (POS) tagging.
One example of this POS tagging approach is the XTract program by Frank Smadja. I managed to find a copy of his paper online (not through the ACM portal): Retrieving Collocations from Text: Xtract. When I get around to it, I think it would be really interesting to combine Stanford's POS tagger with my program to see if I can produce better results.
Finally, I would also like to track down and read Justeson and Katz's paper: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, which I had some difficulty locating through common online sources.
I'd like to deploy my program as a web app so that people can try it out and see the results in real time without having to download and run the program.
Stay tuned.
Hi Jeff,
ReplyDeleteLook forward to the results of your experiments ! An api which you might find useful (you may already be aware of it) is the
Alias-i LingPipe API.
The standford parser is really nicely laid out and easy to work with, Dan Bikel also has a Java parser which you might find useful.
I agree, collocations are very interesting. Somewhat related are Amazon's SIPs (Statistically Improbable Phrases)... their poetic quirkiness makes one giggle... if only they were available through their API.
Thanks for listing the Smadja paper, i haven't seen that one before...