In my spare time at home, I am writing a parser to extract interesting phrases from search result snippets. In other words, I am looking for phrases with a high degree of co-occurrence across search results. For now, I'm using the Google API for the snippets. I think co-occurrences have lots of interesting applications: query narrowing, clustering, information exploration, etc... Not to mention its a fun way to play with the Google API and an excuse to learn J2EE/JSP/servlets.
In my program, I am using a stop word list to help narrow the list of candidate phrases. I was inspired by Manning and Schutze's approach to finding collocations (Stastical NLP, page 157) where they use a stop list to exclude words that are not verbs, nouns, or adjectives. This can be contrasted with Justeson and Katz who use a part of speech tagging approach and limit the n-grams to some common grammatical patterns.
POS tagging is nice, but its a bit overkill for a first pass -- so I am sticking to the stop list approach. I've started compiling a stop word list from various sources. The question is this, what is the best way to apply it?
Do stop words mark the end of a phrase/collocation or can they occur within them? Words in the stop list are not one of the three grammatical types listed by Manning and Schutze and should then not be included in the phrases. However, Oren Zamir, Etzioni, and Selberg took a slightly different approach when they added clustering to Metacrawler in the Grouper project. Grouper removed stop words from the beginning and end of the cluster names, but left them in the middle. This required that they post-process the phrase/cluster list in order to remove leading and trailing stopwords.
I started to take the approach that stop words marked the end of my co-occurence (consistent with Manning and Schutze's POS restrictions), but now I'm not so sure. This might limit the phrases that I discover, but perhaps not by much. Look at the clusters on clusty (try "food") -- there aren't many with stop words, except when they are part of an entity name, like the "Food and Drug Administration." What do you think? How many useful clusters do find with stop words in the middle of them? I'm beginning to think the Grouper approach is better for this application since I am not stricly looking for simple collocations. POS patterns, like those used by Justeson and Katz would be an elegant solution, but I'm going to blow that one off for V1.0.
I am also trying to decide whether or not to take what I've built so far and integrate it with LingPipe. A reader recommended it to me in the past for this project and it looks really cool. It has built in filters, word tagging, sentence boundary detection, and more. My concern is that it might be a little overkill. In addition, it can be more instructive to do things yourself. However, in the process you often end re-inventing the wheel.
I plan on putting a JSP/Servlet front-end on the system. If anyone has a stable Tomcat server they'd be willing to provide access to, that might help things along. In the meantime, I'll run it off my machine and see if I can't find something better. Weddings have a wonderful way of sucking up spare cash, so a second machine isn't on option for me right now.
That's all for now. Now, back to my coding... and the world of Eclipse!