Thursday, December 15

Information Retrieval and other interesting collocations

In my spare time, on evenings and weekends I have been doing some research (programming) on collocations and term co-occurrence, especially in relation to their applications in web search. Let me begin with some definitions.

According to Manning and Schutze, A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schutze, 1999). Another definition used by Columbia NLP researcher Frank Smadja is: A collocation is an arbitrary and recurrent word combination (Benson 1990). Term co-occurrence is less strict, it is simply words that happen to occur in the same document.

Collocations are very interesting. They are unique combinations of words that have a special meaning, like "a stiff wind." When used in a technical context they are things like technical terms. For example, the title of this article includes a CS/IR collocation: "information retrieval", another might be "information extraction." For a full discussion of Collocations you can read Chapter 5 from Manning & Schutze's book, which happens to be availabe online.

Collocations can be used in web search in some interesting ways. One way they are used is to guide the discovery of information. They give you ideas for keywords you might want to add to your query to make it more specific. One example of this is Ask Jeeve's "narrow your search" suggestions. Collocations are also used in clustering, but that is a story for another day.

In my spare time, evening and weekends, I have been writing a Java program that uses the Google API to extract interesting collocations from web search result summaries. It utilizes N-Grams (sequences of words or characters) to identify interesting phrases. Hopefully, I'll have it working well enough to post online soon.

My current approach is not too sophisticated, I am using a list of stop words (a, the, is of, etc...) to remove "uninteresting" phrases. If I have time I would like to pursue a more sophisticated approach that utilizes part of speech (POS) tagging.

One example of this POS tagging approach is the XTract program by Frank Smadja. I managed to find a copy of his paper online (not through the ACM portal): Retrieving Collocations from Text: Xtract. When I get around to it, I think it would be really interesting to combine Stanford's POS tagger with my program to see if I can produce better results.

Finally, I would also like to track down and read Justeson and Katz's paper: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, which I had some difficulty locating through common online sources.

I'd like to deploy my program as a web app so that people can try it out and see the results in real time without having to download and run the program.

Stay tuned.

1 comment:

  1. Hi Jeff,

    Look forward to the results of your experiments ! An api which you might find useful (you may already be aware of it) is the
    Alias-i LingPipe API.

    The standford parser is really nicely laid out and easy to work with, Dan Bikel also has a Java parser which you might find useful.

    I agree, collocations are very interesting. Somewhat related are Amazon's SIPs (Statistically Improbable Phrases)... their poetic quirkiness makes one giggle... if only they were available through their API.

    Thanks for listing the Smadja paper, i haven't seen that one before...