Monday, July 7

Susan Dumais SIGIR 2014 ACM Athena Award Lecture

Sue Dumais
 - Introduced by Marti Hearst
http://research.microsoft.com/en-us/um/people/sdumais/

Putting the searcher back into search

The Changing IR landscape
 - from an arcane skill for librarians and computer geeks to the fabric of everyone's lives

 - How should the evaluation metrics be enriched to characterize the diversity of modern information systems?

How far we have come....
 - The web was a nascent thing 20 years ago.  
 - In June '94; there were 2.7k websites (13.5% were .com)
 - Mosaic was one year old
 - Search in 1994: 17th SIGIR;  text categorization, relevance feedback and query expansion
 - TREC was 2.5 years old (several hundred k newswire; federal register)
 - TREC 2 and 3, the first work on learning to rank
 - Size of Lycos debuted (Fuzzy Malden), # web pages - 54k pages (first 128 characters)
   --> 400k pages, to 10s of millions
   --> The rise of infoseek, altavista
 - Behavioral logs: #queries/day: 1.5k

Today, search is everywhere
 - Trillions of webpages
 - Billions of searches and clicks per day
 - A core part of everyday life; the most popular activity on the web. 
 - We should be proud, but... search can be so much more.

Search still fails 20-25% of the time.  And you often invest way more effort than you should. Once you find an item, there is no opportunity to do anything with it. 
- Requires both great results and great experiences (understanding users and whether they are satisfied)

Where are the Searchers in Search?
 - A search box to results
 - But, queries don't fall from the sky in an IID fashion; they come from people trying to perform tasks at a particular place and time. 
 - Search is not the end goal; people search because they are trying to accomplish something. 

Evaluation
 - Cranfield style test collections
 - "A ruthless abstraction of the user .."
 - There is still tremendous variability across topics. 
 - What's missing?
  --> Characterization of queries/tasks
       -- How are they selected?  How can we generalize to?
  --> We do not tend to have searcher-centered metrics
  --> Rich models of searchers
  --> Presentation and interaction


Evaluation search systems

Kinds of behavioral data
 - Lab studies  (detailed instrumentation and interaction)
 - Panel studies (in the wild; 100s to 1000s; special search client)
 - Log studies (millions of people; in the wild, unlabeled) - provides what, not why

Observational study
 - look at how people interact with an existing system; build a model of behavior.

Experimental studies
 - compare existing systems; goal: decide if one approach is better than another

Surprises in (Early) web search logs
 - Early log analysis...
  --> Excite logs in 1997, 1999
  --> Silverstein et al. 1998, 2002
  --> web search != library search
  --> Queries are very short, 2.4 words
  --> Lots of people search about sex
  --> Searches are related (sessions)

Queries not equally likely
 - excite 1999; ~2.5 million queries
 - top 250 queries are 10% of the queries
 - almost a million occurred once
 - top 10: sex, yahoo, chat, horoscope, pokemon, hotmail, games, mp3, weather, ebay
 - tail: acm98; win2k compliance; gold coast newspaper

Queries vary of time and task
 - Periodicities, trends, events
 - trends: like Tesla, repeated patterns: pizza on saturday
 - What's relevant to the queries changes over time (World Cup) -- What's going on now!
 - Task/Individual - 60% of queries occur in a session

What can logs tell us?
 - query frequency
 - patterns
 - click behavior

-- Experiments are the life blood of web systems
 - for every imaginable system variation (ranking, snippets, fonts, latency)
 - if I have 100M dollars to spend, what is important?
 - Basic questions:  What do you want to evaluate?  What are the metrics?

Uses of behavioral logs
 - Often surprising insights about how people interact with search systems
 - Suggest experiments

How do you go from 2.4 words to great results? 
 -> Lots of log data driving important features (query suggestion, autocompletion)

What they can't tell us?
 - Behavior can mean many things
 - Difficult to explore new systems

Web search != library search
 - Traditional "information needs" do not describe web searcher behavior
 - Broder 2002 from Alta Vista logs
 - They did a pop up survey in Jun-Nov through 2001

Desktop search != web search
 - desktop search, circa 2000
 - Stuff I've Seen
 - Example searches:  recent email from Fedor that contained a link to his new demo; query: Fedor
 pdf of a SIGIF paper on context and ranking sent a month ago; query: SIGIR
 - Deployed 6 versions of the system
 -> Queries: very short;  Date was by FAR the most common sort order
 -> Seldom do people switch from the default, but they did from best match to Date; the information from James; people remember a rough time. 
 -> People didn't care about they type of file, they cared that it was an image. 
 -> More re-finding than finding, more metadata than best match driven
 -> People remember attributes, seldom the details, only the general topic
 --> Rich client-side interface; every time we go int an are they have characteristics that are very different from other generations of search

Contextual Retrieval
 - One size does not it all
  --> SIGIR  (who's asking, where are they, what they have done in the past)
 - Queries are difficult to interpret in isolation
 - SIGIR - information retrieval vs. special inspector general for iraq reconstruction
 - A single ranking severely limits the potential because different people have different notions of relevance

Potential for Personalization
http://research.microsoft.com/en-us/um/people/horvitz/teevan_dumais_horvitz_tochi_2010.pdf
 - Framework to quantify the variation of relevance for the same query across individuals (Teevan et al., ToCHI 2010)
 - Regardless of how you measure it, there is tremendous potential; it varies widely across different queries
 - 46% potential increase in search ranking
 - 70% if we taken into account individual notions of relevance
 - Need to model individuals in different ways

Personal navigation
 - Teevan et al. SIGIR 2007, Tyler and Teevan WSDM 2010
 - Re-finding in web search; 33% are queries you've issued before
 - 39% of clicks are things they've visited before
 - "Personal" navigation queries
 --> 15% of queries 
 --> simple 12 line algorithm
 --> If you issued a query and clicked on on only one link twice, you are 95% likely to do it again
 - Resulted in online A/B experiments (successfully)

Adaptive Ranking
Bennett et al. SIGIR 2012
 - Queries don't occur in isolation
 - 60% of sessions contain multiple queries
 - 50% of the time occur in sessions that last 30+ mins (infrequent, but important)
 - 15% of tasks continue across sessions

User Model
 - specific queries, URLs, topic distributions
 - Session (short) +25%
 - Historic (long) +45%
 - combinations  - 65-75%

- By third query in a session, just pay attention to what is happening now. 

Summary
 - We have complementary methods to understand and model searchers
 - Especially important in new search domains and in accommodating the variability we see across people and tasks in the real world

Future
 - More and more importance of spatial-temporal context (here now)
 - Richer representations and dialogs
  --> e.g. knowledge graphs
 - More proactive search (especially in mobile)
 - Tighter coupling of digital and physical worlds 
 - Computational platforms that couple human and algorithmic components
 - If search doesn't work for people, it doesn't work; Let's make sure it does!!!

We need to extend our evaluation methodologies to handle the diversity of searchers, tasks, and interactivity.

Disclaimer: The views here expressed are purely mine and do not reflect those of Google in any way.