Thursday, March 23

Diffing search engine stop words lists

Here is an interesting question: What words does google ignore? I believe it must be query dependent. Try a search for "where how" and then "where how computers work". The where and how are not ignored in the first query, but they are in the second.

Nonetheless, there might be a set of "standard" set of words that google ignores, along with perhaps query dependent words in some cases.

I have yet to find a complete or up to date list of stop words (aka words google "ignores"). It doesn't really ignore them for the purposes of ranking, but that's a whole 'nuther story.

What about Yahoo or MSN? Stop words are application/domain specific, so what words are the search engines using? Do they differ, and if so, how?

Perhaps I'll experiment with this, it shouldn't be hard to figure out. In the meantime, has anyone already done this?

One thing that continually impresses me with Google is their attention to the "little things" which make search incrementally better. Dynamic stopping is one of those things. Another is abbreviation identification (try a search for "ACM" and see that Google highlights "Association of Computing Machininery" in the search results.