Friday, January 13

To stop or not to stop that is the question

In my spare time at home, I am writing a parser to extract interesting phrases from search result snippets. In other words, I am looking for phrases with a high degree of co-occurrence across search results. For now, I'm using the Google API for the snippets. I think co-occurrences have lots of interesting applications: query narrowing, clustering, information exploration, etc... Not to mention its a fun way to play with the Google API and an excuse to learn J2EE/JSP/servlets.

In my program, I am using a stop word list to help narrow the list of candidate phrases. I was inspired by Manning and Schutze's approach to finding collocations (Stastical NLP, page 157) where they use a stop list to exclude words that are not verbs, nouns, or adjectives. This can be contrasted with Justeson and Katz who use a part of speech tagging approach and limit the n-grams to some common grammatical patterns.

POS tagging is nice, but its a bit overkill for a first pass -- so I am sticking to the stop list approach. I've started compiling a stop word list from various sources. The question is this, what is the best way to apply it?

Do stop words mark the end of a phrase/collocation or can they occur within them? Words in the stop list are not one of the three grammatical types listed by Manning and Schutze and should then not be included in the phrases. However, Oren Zamir, Etzioni, and Selberg took a slightly different approach when they added clustering to Metacrawler in the Grouper project. Grouper removed stop words from the beginning and end of the cluster names, but left them in the middle. This required that they post-process the phrase/cluster list in order to remove leading and trailing stopwords.

I started to take the approach that stop words marked the end of my co-occurence (consistent with Manning and Schutze's POS restrictions), but now I'm not so sure. This might limit the phrases that I discover, but perhaps not by much. Look at the clusters on clusty (try "food") -- there aren't many with stop words, except when they are part of an entity name, like the "Food and Drug Administration." What do you think? How many useful clusters do find with stop words in the middle of them? I'm beginning to think the Grouper approach is better for this application since I am not stricly looking for simple collocations. POS patterns, like those used by Justeson and Katz would be an elegant solution, but I'm going to blow that one off for V1.0.

I am also trying to decide whether or not to take what I've built so far and integrate it with LingPipe. A reader recommended it to me in the past for this project and it looks really cool. It has built in filters, word tagging, sentence boundary detection, and more. My concern is that it might be a little overkill. In addition, it can be more instructive to do things yourself. However, in the process you often end re-inventing the wheel.

I plan on putting a JSP/Servlet front-end on the system. If anyone has a stable Tomcat server they'd be willing to provide access to, that might help things along. In the meantime, I'll run it off my machine and see if I can't find something better. Weddings have a wonderful way of sucking up spare cash, so a second machine isn't on option for me right now.

That's all for now. Now, back to my coding... and the world of Eclipse!

Wednesday, January 11

Globalspec Most Memorable Spam of 2005: Dogpile Cloakers

I am going to steal a page from Matt Cutts, and talk about spam. Globalspec's Engineering Web is a gated community of only the engineering domain. Considering the limited scope, one might think that fighting spam would be easier than out there in the wilder "horizontal web" -- the would of GYM, right? Wrong. While GlobalSpec may not have to deal with Britney Spears and company (as much), it may surprise you to find that there are lots of people out there targeting the engineering domain with spam.

Over that past year, I've seen pretty much every type of spam the major SEs deal. I've seen major improvements in fighting spam in 2005. There was spam of every kind: domain hijacking, re-purposed content (ODP/Wikipedia), dynamic content generators, link spam, splogs, etc. (FYI if you want a good overview of web spam, check out Marc Najork's WebSpam presentation to the UC Sims 141 class.) However, one of the most difficult and insidious spam techniques from a search engine's perspective is cloaking. Cloaking is sending the search engine spiders different content than users see when they visit the page. Search Engine World has a good overview about the different types of cloaking spam.

My two most memorable spammers of 2005 are two cloaking sites that we caught. The two sites are and Both of these sites use referrer based cloaking. It's really sneaky. Let me illustrate:

The URL:

Now look at this cloaked version coming from off of MSN search results:

Now, I won't pass judgment on Dogpile, but the webmaster doesn't seem to get any tangible benefits, unlike Google Adsense spammers. Dogpile, on the other hand, gets to leech off of other search engines' results and gain traffic and therefore money. I'll let you investigate and draw your own conclusions.

Here's how the spammers operate in this spam network. They use referrer based cloaking. When you come in from a search engine, they detect the external referrer (and direct navigation with no referrer) and inject a few seemingly innocuous lines of html into the page. Here is they code they inject (with the opening < removed)

frame src="'" qkw="site%3acold%20forming%20company%20cold%20forged&qcat=">

script language="JavaScript">location.replace(
Whoa. The content is still there, but it is effectively hidden through their use of frames and javascript, which was previously not in the page. If you look at the bottom of the results page, there is a 1 row high line, which is where all the old content is displayed.

What's interesting is to examine which SEs have caught the above sites and which ones haven't. How good are the major SEs at detecting referrer cloaking spam? Google has not indexed MSN has pages and Yahoo has only their homepage. On the other hand, Google has indexed, Yahoo again only has the homepage, but MSN does not have it indexed. Clearly, even the big three search engines have mixed success dealing with this type of spam.

FYI, the two sites mentioned above are a part of a much larger spam network. Here is a small sampling of the sites:

And the list is much longer than that. My, what a nice little spam network they've got there.

My prediction for 2006 is that these problems will become even more of a problem for SEs. Search engines of every kind will need to devote more resources to shoring up their defenses and weeding out crap like the above to stay relevant. I hope to see blacklists of these types of sites that are known referrer spammers to make it easier to filter these sites out of results.