- Introduced by Marti Hearst
Putting the searcher back into search
The Changing IR landscape
- from an arcane skill for librarians and computer geeks to the fabric of everyone's lives
- How should the evaluation metrics be enriched to characterize the diversity of modern information systems?
How far we have come....
- The web was a nascent thing 20 years ago.
- In June '94; there were 2.7k websites (13.5% were .com)
- Mosaic was one year old
- Search in 1994: 17th SIGIR; text categorization, relevance feedback and query expansion
- TREC was 2.5 years old (several hundred k newswire; federal register)
- TREC 2 and 3, the first work on learning to rank
- Size of Lycos debuted (Fuzzy Malden), # web pages - 54k pages (first 128 characters)
--> 400k pages, to 10s of millions
--> The rise of infoseek, altavista
- Behavioral logs: #queries/day: 1.5k
Today, search is everywhere
- Trillions of webpages
- Billions of searches and clicks per day
- A core part of everyday life; the most popular activity on the web.
- We should be proud, but... search can be so much more.
Search still fails 20-25% of the time. And you often invest way more effort than you should. Once you find an item, there is no opportunity to do anything with it.
- Requires both great results and great experiences (understanding users and whether they are satisfied)
Where are the Searchers in Search?
- A search box to results
- But, queries don't fall from the sky in an IID fashion; they come from people trying to perform tasks at a particular place and time.
- Search is not the end goal; people search because they are trying to accomplish something.
- Cranfield style test collections
- "A ruthless abstraction of the user .."
- There is still tremendous variability across topics.
- What's missing?
--> Characterization of queries/tasks
-- How are they selected? How can we generalize to?
--> We do not tend to have searcher-centered metrics
--> Rich models of searchers
--> Presentation and interaction
Evaluation search systems
Kinds of behavioral data
- Lab studies (detailed instrumentation and interaction)
- Panel studies (in the wild; 100s to 1000s; special search client)
- Log studies (millions of people; in the wild, unlabeled) - provides what, not why
- look at how people interact with an existing system; build a model of behavior.
- compare existing systems; goal: decide if one approach is better than another
Surprises in (Early) web search logs
- Early log analysis...
--> Excite logs in 1997, 1999
--> Silverstein et al. 1998, 2002
--> web search != library search
--> Queries are very short, 2.4 words
--> Lots of people search about sex
--> Searches are related (sessions)
Queries not equally likely
- excite 1999; ~2.5 million queries
- top 250 queries are 10% of the queries
- almost a million occurred once
- top 10: sex, yahoo, chat, horoscope, pokemon, hotmail, games, mp3, weather, ebay
- tail: acm98; win2k compliance; gold coast newspaper
Queries vary of time and task
- Periodicities, trends, events
- trends: like Tesla, repeated patterns: pizza on saturday
- What's relevant to the queries changes over time (World Cup) -- What's going on now!
- Task/Individual - 60% of queries occur in a session
What can logs tell us?
- query frequency
- click behavior
-- Experiments are the life blood of web systems
- for every imaginable system variation (ranking, snippets, fonts, latency)
- if I have 100M dollars to spend, what is important?
- Basic questions: What do you want to evaluate? What are the metrics?
Uses of behavioral logs
- Often surprising insights about how people interact with search systems
- Suggest experiments
How do you go from 2.4 words to great results?
-> Lots of log data driving important features (query suggestion, autocompletion)
What they can't tell us?
- Behavior can mean many things
- Difficult to explore new systems
Web search != library search
- Traditional "information needs" do not describe web searcher behavior
- Broder 2002 from Alta Vista logs
- They did a pop up survey in Jun-Nov through 2001
Desktop search != web search
- desktop search, circa 2000
- Stuff I've Seen
- Example searches: recent email from Fedor that contained a link to his new demo; query: Fedor
pdf of a SIGIF paper on context and ranking sent a month ago; query: SIGIR
- Deployed 6 versions of the system
-> Queries: very short; Date was by FAR the most common sort order
-> Seldom do people switch from the default, but they did from best match to Date; the information from James; people remember a rough time.
-> People didn't care about they type of file, they cared that it was an image.
-> More re-finding than finding, more metadata than best match driven
-> People remember attributes, seldom the details, only the general topic
--> Rich client-side interface; every time we go int an are they have characteristics that are very different from other generations of search
- One size does not it all
--> SIGIR (who's asking, where are they, what they have done in the past)
- Queries are difficult to interpret in isolation
- SIGIR - information retrieval vs. special inspector general for iraq reconstruction
- A single ranking severely limits the potential because different people have different notions of relevance
Potential for Personalization
- Framework to quantify the variation of relevance for the same query across individuals (Teevan et al., ToCHI 2010)
- Regardless of how you measure it, there is tremendous potential; it varies widely across different queries
- 46% potential increase in search ranking
- 70% if we taken into account individual notions of relevance
- Need to model individuals in different ways
- Teevan et al. SIGIR 2007, Tyler and Teevan WSDM 2010
- Re-finding in web search; 33% are queries you've issued before
- 39% of clicks are things they've visited before
- "Personal" navigation queries
--> 15% of queries
--> simple 12 line algorithm
--> If you issued a query and clicked on on only one link twice, you are 95% likely to do it again
- Resulted in online A/B experiments (successfully)
Bennett et al. SIGIR 2012
- Queries don't occur in isolation
- 60% of sessions contain multiple queries
- 50% of the time occur in sessions that last 30+ mins (infrequent, but important)
- 15% of tasks continue across sessions
- specific queries, URLs, topic distributions
- Session (short) +25%
- Historic (long) +45%
- combinations - 65-75%
- By third query in a session, just pay attention to what is happening now.
- We have complementary methods to understand and model searchers
- Especially important in new search domains and in accommodating the variability we see across people and tasks in the real world
- More and more importance of spatial-temporal context (here now)
- Richer representations and dialogs
--> e.g. knowledge graphs
- More proactive search (especially in mobile)
- Tighter coupling of digital and physical worlds
- Computational platforms that couple human and algorithmic components
- If search doesn't work for people, it doesn't work; Let's make sure it does!!!
We need to extend our evaluation methodologies to handle the diversity of searchers, tasks, and interactivity.
Disclaimer: The views here expressed are purely mine and do not reflect those of Google in any way.