Tuesday, December 7

Barriers to Entry in Search Getting Lower

The Mim's Bits column in the MIT Tech Review has an article, You, Too Can Be the Next Google. In the article, Tom Annau, the CTO of blekko (see my previous post) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.
"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...
To be more efficient, they are more careful about what they crawl by:
  1. Avoiding crawling spam and splog content
  2. Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.
I'm not sure blekko's "efficiency" techniques are particularly interesting or novel. However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like Bixo).
"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.
The recent Mei and Church paper in 2008, Entropy of search logs: how hard is search? with personalization? with backoff?, analyzed a large search engine log to determine the size of this 'interesting' part of the web. They find that they can encode the URLs from search logs using approximately 22 bits, millions of pages. As they say,
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.
In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced. In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity. In short, there is great opportunity for small very personalized verticals.

I think the main reason that blekko needs a modest number of servers is that its query volume is small. One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world. To pull this off Google keeps its search index in memory (see Jeff Dean WSDM 2009 keynote).

Wednesday, December 1

Google Fixes DecorMyEyes.com Problem

This afternoon Amit Singhal, from Google Search Quality, wrote a blog post about how Google fixed the recent DecorMyEyes.com fiasco. The store was broken by the NY Times article exposing how a disreputable merchant gained high ranking by being mean to his customers. He gained links and reputation by being written up negatively on many popular and important sites.

What's interesting is Amit's post is the insight into how Google's team approached solving the problem.

1) Only blocking the site would not solve the underlying problem.
2) Sentiment analysis wouldn't solve the problem because reputation was coming from neutral news sites with solid reputations. Google has not yet found a useful way to incorporate sentiment into ranking.
3) Expose the reviews and ratings next to the results, but this would not actually alter the ranking.

Their new secret undisclosed fixed detected the problematic merchant and several hundred other bad apples.