Thursday, February 12

WSDM 2009 Best Papers and other Highlights from Matt Lease

(See previous WSDM 2009 coverage here and here.)

Things are a bit crazy at WSDM and, as usual, Internet connectivity is spotty. However, in his exhausted state Matt Lease sent me a few highlights. He's been doing some great research with us here at Amherst and is graduating this summer in case anyone is interested in that sort of thing.

How many CS PhDs does it take to fix a projector?
159. 9 actively, 150 watching.
After 30 min they managed to get it working by plugging into the projector, meaning presenters had to do hand signals to get slide changes for rest of session -- everyone had their own distinct style of coping with the absurdity of the situation.

Best Paper Awards
Best Paper: Integration of News Content into Web Results by Fernando Diaz (congratulations! A recent CIIR alum brings in another best paper award)
Best Student Paper: The Web Changes Everything: Understanding the Dynamics of Web Content by Eytan Adar, Jon Elsas, Jamie Teevan, and Susan Dumais (congratulations Jon and Eytan!)
Best Late Breaking: Time Will Tell: Leveraging Temporal Expressions in IR by Irem Arikan, Srikanta Bedathur and Klaus Berberich

Interesting Links
A new visual interface for browsing (try computer science)
tags.stanford.edu - help Collect User Generated tags search tags (see their firefox plugin)

What's Hot
  • Rendering bags of words with frequency shown by font size
  • Tags & IR (flickr, delicious, LibraryThingWorks, ...)
  • Mechanical turk evaluation / annotation
  • Wikipedia (resource)
  • Open Directory Project (categorization)
Thanks Matt!

Wednesday, February 11

Jeffrey Dean WSDM Keynote: Building a Large Scale Search Engine

If you missed them be sure to check out Workshop Day and Matt Lease's notes, including best papers.

Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. My labmate, Michael Bendersky, attended the talk and was nice enough to share his notes. Here's his slightly edited notes.

Update on Mar 10: Jaap Kamps noted that the slides have now been posted online.

The video is now available on VideoLectures.

Change in scale 1999 - 2009
  • 100x documents
  • 10000x queries
  • update latency 10,000x faster
  • query latency <1s> 0.2s -> 5x decrease
Design the search engine for 10x growth; rewrite at 100x growth. Next, he outlined how Google's crawling and indexing has evolved since the late 90s. Here a few highlights.

In the late 90s
  • Batch crawling system, stopped when 'enough' pages
  • Batch indexing process stitched together with Unix tools. Prone to machine failure and inconsistencies.
  • The original '97 index format was a simple byte aligned system that encoded field and presentation information for word 'hits' (word occurrences). It required a lot disk access.
Soon After
  • Moved to a new block-based variable length index format that used skip tables for words with large numbers of occurrences. This reduced index size by 30% and was faster to decode.
  • Added cache servers for both results and document snippets
  • In early 2001 they moved to an in-memory index where index servers (along with doc servers, cache servers, etc...) talked directly to front-end web servers. (See system diagram on slides)
  • The index was partitioned by document rather than by term.
Recent and current
  • In-house design from the ground up: rack design, pc class motherboards, linux, and in-house software (GFS, BigTable, etc...)
  • Indexing built using MapReduce framework
  • In 2004 they moved to a hierarchical system for serving indexes built on top of GFS based hosted indexes (now only the "root server" handles requests from the web server).
  • 'Fast' index updates
  • In 2007 they added a 'super root' server that communicates with all of the index servers for the vertical collections allowing 'Universal Search'
How Google experiments with ranking changes
Goal: It should be easy to do experiments.
  1. Start with a new ranking idea
  2. Generate data to run experiments rapidly using MapReduce, BigTable, etc..
  3. Run off-line and measure results on 1) human-rated query sets of various kinds and 2) on random queries to look at changes to existing ranking (latency doesn't matter)
  4. repeat...
  5. Experiment on a fraction of small random sample of traffic (latency matters!)
  6. Re-implement/tune implementation to pre-compute data to make it feasible at full load and incorporate necessary data into the index.
Future Challenges
  • Cross-Lingual IR - quality and scalability
  • Retrieval with a mix of private, semi-private, shared and public documents
  • Automatic Construction of efficient IR systems for different needs
More WSDM coverage to come. If you are attending the conference and you have notes you'd like to share on the sessions I'd love to hear from you.

Thanks again to Michael! I really enjoyed the notes and I hope there will be more great information to share soon.

Monday, February 9

WSDM 2009 Workshop Day

Update: See also the ongoing coverage with the Jeff Dean Keynote coverage and best paper awards and other highlights.

Today is workshop day at Web Search and Data Mining 2009 in Barcelona. I'm not attending, but there are several attendees from the lab who I've recruited for coverage. WSDM is a young conference, but the quality of the papers is very high. It already rivals the other major conferences: SIGIR, CIKM, and WWW. WSDM has a good blend of academic and industry focused research. There are two workshops being held today:

ESAIR 2009 - The Second Workshop on Exploiting Semantic Annotations in Information Retrieval. The link above has a copy of the proceedings with all of the papers.

WSCD09 - Workshop on Web Search Click Data 2009
From our lab, Michael Bendersky will be presenting work analyzing long queries, Analysis of Long Queries in a Large Scale Search Log. Workshop participants were granted access to a sample of 15 million queries sampled from one month of Microsoft's search engine to analyze.

More to come...