Wednesday, February 11

Jeffrey Dean WSDM Keynote: Building a Large Scale Search Engine

If you missed them be sure to check out Workshop Day and Matt Lease's notes, including best papers.

Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. My labmate, Michael Bendersky, attended the talk and was nice enough to share his notes. Here's his slightly edited notes.

Update on Mar 10: Jaap Kamps noted that the slides have now been posted online.

The video is now available on VideoLectures.

Change in scale 1999 - 2009
  • 100x documents
  • 10000x queries
  • update latency 10,000x faster
  • query latency <1s> 0.2s -> 5x decrease
Design the search engine for 10x growth; rewrite at 100x growth. Next, he outlined how Google's crawling and indexing has evolved since the late 90s. Here a few highlights.

In the late 90s
  • Batch crawling system, stopped when 'enough' pages
  • Batch indexing process stitched together with Unix tools. Prone to machine failure and inconsistencies.
  • The original '97 index format was a simple byte aligned system that encoded field and presentation information for word 'hits' (word occurrences). It required a lot disk access.
Soon After
  • Moved to a new block-based variable length index format that used skip tables for words with large numbers of occurrences. This reduced index size by 30% and was faster to decode.
  • Added cache servers for both results and document snippets
  • In early 2001 they moved to an in-memory index where index servers (along with doc servers, cache servers, etc...) talked directly to front-end web servers. (See system diagram on slides)
  • The index was partitioned by document rather than by term.
Recent and current
  • In-house design from the ground up: rack design, pc class motherboards, linux, and in-house software (GFS, BigTable, etc...)
  • Indexing built using MapReduce framework
  • In 2004 they moved to a hierarchical system for serving indexes built on top of GFS based hosted indexes (now only the "root server" handles requests from the web server).
  • 'Fast' index updates
  • In 2007 they added a 'super root' server that communicates with all of the index servers for the vertical collections allowing 'Universal Search'
How Google experiments with ranking changes
Goal: It should be easy to do experiments.
  1. Start with a new ranking idea
  2. Generate data to run experiments rapidly using MapReduce, BigTable, etc..
  3. Run off-line and measure results on 1) human-rated query sets of various kinds and 2) on random queries to look at changes to existing ranking (latency doesn't matter)
  4. repeat...
  5. Experiment on a fraction of small random sample of traffic (latency matters!)
  6. Re-implement/tune implementation to pre-compute data to make it feasible at full load and incorporate necessary data into the index.
Future Challenges
  • Cross-Lingual IR - quality and scalability
  • Retrieval with a mix of private, semi-private, shared and public documents
  • Automatic Construction of efficient IR systems for different needs
More WSDM coverage to come. If you are attending the conference and you have notes you'd like to share on the sessions I'd love to hear from you.

Thanks again to Michael! I really enjoyed the notes and I hope there will be more great information to share soon.

No comments:

Post a Comment