The Mim's Bits column in the MIT Tech Review has an article,
You, Too Can Be the Next Google. In the article, Tom Annau, the CTO of
blekko (see my
previous post) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.
"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...
To be more efficient, they are more careful about what they crawl by:
- Avoiding crawling spam and splog content
- Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.
I'm not sure blekko's "efficiency" techniques are particularly interesting or novel. However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like
Bixo).
"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.
Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.
In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced. In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity. In short, there is great opportunity for small very personalized verticals.
I think the main reason that blekko needs a modest number of servers is that its query volume is small. One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world. To pull this off Google keeps its search index in memory (see
Jeff Dean WSDM 2009 keynote).