Tuesday, December 7

Barriers to Entry in Search Getting Lower

The Mim's Bits column in the MIT Tech Review has an article, You, Too Can Be the Next Google. In the article, Tom Annau, the CTO of blekko (see my previous post) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.
"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...
To be more efficient, they are more careful about what they crawl by:
  1. Avoiding crawling spam and splog content
  2. Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.
I'm not sure blekko's "efficiency" techniques are particularly interesting or novel. However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like Bixo).
"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.
The recent Mei and Church paper in 2008, Entropy of search logs: how hard is search? with personalization? with backoff?, analyzed a large search engine log to determine the size of this 'interesting' part of the web. They find that they can encode the URLs from search logs using approximately 22 bits, millions of pages. As they say,
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.
In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced. In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity. In short, there is great opportunity for small very personalized verticals.

I think the main reason that blekko needs a modest number of servers is that its query volume is small. One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world. To pull this off Google keeps its search index in memory (see Jeff Dean WSDM 2009 keynote).

2 comments:

  1. I haven't read the Mei and Church paper so I could be off-track here. Relevance rankers are pre-dominantly popularity biased (eg. PageRank) and a re-enforcing mechanism comes into play resulting in a much smaller number of pages that are visited. Analyzing search engine logs will be a reflection of this user behavior and who in the majority of cases only consider the top few results.

    ReplyDelete
  2. @Jeff, I'm sure that the web search engines do track what will probably change, and check that more frequently. Infoseek/Ultraseek had a cute algorithm with a minimum and maximum time between URL checks, and then calculated the rate of change and changed the schedule based on that.

    I wonder if anyone's using fast-moving data like twitter & facebook feeds to prioritize URL crawling. That would overcome some of the conservatism of inlink ranking.

    @Dinesh - pretty much everyone who does more specific search log clickthrough analysis is *very* much aware of the popularity of the top few results and adjust for that.

    ReplyDelete