Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. My labmate, Michael Bendersky, attended the talk and was nice enough to share his notes. Here's his slightly edited notes.
Update on Mar 10: Jaap Kamps noted that the slides have now been posted online.
The video is now available on VideoLectures.
Change in scale 1999 - 2009
- 100x documents
- 10000x queries
- update latency 10,000x faster
- query latency <1s> 0.2s -> 5x decrease
In the late 90s
- Batch crawling system, stopped when 'enough' pages
- Batch indexing process stitched together with Unix tools. Prone to machine failure and inconsistencies.
- The original '97 index format was a simple byte aligned system that encoded field and presentation information for word 'hits' (word occurrences). It required a lot disk access.
- Moved to a new block-based variable length index format that used skip tables for words with large numbers of occurrences. This reduced index size by 30% and was faster to decode.
- Added cache servers for both results and document snippets
- In early 2001 they moved to an in-memory index where index servers (along with doc servers, cache servers, etc...) talked directly to front-end web servers. (See system diagram on slides)
- The index was partitioned by document rather than by term.
- In-house design from the ground up: rack design, pc class motherboards, linux, and in-house software (GFS, BigTable, etc...)
- Indexing built using MapReduce framework
- In 2004 they moved to a hierarchical system for serving indexes built on top of GFS based hosted indexes (now only the "root server" handles requests from the web server).
- 'Fast' index updates
- In 2007 they added a 'super root' server that communicates with all of the index servers for the vertical collections allowing 'Universal Search'
Goal: It should be easy to do experiments.
- Start with a new ranking idea
- Generate data to run experiments rapidly using MapReduce, BigTable, etc..
- Run off-line and measure results on 1) human-rated query sets of various kinds and 2) on random queries to look at changes to existing ranking (latency doesn't matter)
- Experiment on a fraction of small random sample of traffic (latency matters!)
- Re-implement/tune implementation to pre-compute data to make it feasible at full load and incorporate necessary data into the index.
- Cross-Lingual IR - quality and scalability
- Retrieval with a mix of private, semi-private, shared and public documents
- Automatic Construction of efficient IR systems for different needs
Thanks again to Michael! I really enjoyed the notes and I hope there will be more great information to share soon.