Thursday, July 23

SIGIR Evaluation Workshop: Evaluating IR in Situ by Susan Dumais


IR systems are developed to help people find information to satisfy their need.

Success depends on two general components
- The system depends on the matching system (content and ranking)
- User interface and interaction

* Data is a critical resource!

$$ You have won $10 million $$
* Challenge: You have been asked to lead a team to improve a big web search engine. How do you spend it?
Ideas: study what users are doing and how to help them. Work out what you are doing badly and do better. ... you need to understand what improve means.

* Content
- ranking, crawling, spam detection
* User experience
- presentation (speed, layout, snippets, etc..)
- Features like spelling correction, related searches, ...
- Richer capabilities to support query articulation, results analysis...

Depends on:
* What are the problems now?
* What are you trying to optimize?
* What are the costs and effect sizes?
* What are the tradoffs
* How do various components combine?

Evaluating Search Systems
* Traditional test collections
- fixed docs, queries, relevand judgments, metrics
- goal: compare systems w/respect to the metric (mean, not variance!)
- Search engines do this, but not just this!

* What's missing
- Metrics: user model, average precision, all queries are equal
- Queries: types of queries, history of queries session and longer (trends!)
- Docs: the "set of documents - duplicates, site collapsing, diversity, etc... (interdependent)
- Selection: the nature and dynamics of queries, documents and users
- Users: individual differences (location, personalization, including re-finding, iteration, and interaction
- Presentation: Snippets, speed, features (spelling, query suggestion, the whole page

Kinds of User Data
* User studies (lab setting, controlled tasks, detailed instrumentation (including gaze, video), nuanced interpretation of behavior.

* User panels
- In the wild, user-tasks, reasonable instrumentation, can probe for more detail (put out an instrumented system to hundreds or thousand of users and collect more detail)

* Log analysis (in the large)
- in the wild, user-tasks, no explicit feedback, but lots of implicit indicators. The what vs. the why.

User Studies
- The SIS timeline experiment
- Lab setting, small scale (10-100s of users)
- Months for data
- Known tasks and known outcome (labeled data)
- Detailed logging of queries, URLs visited, scrolling, gaze, trackin, video
- Can evaluate experimental prototypes
- Challenges - user sample, behavior w/experimenter present or w/ new features
* You have to pick the questions you want to ask very carefully because it's time consuming and expensive.

User Panels

- Curious Browser toolbar: 3 panels of 2000 people each back in 2005
- Link explicit user judgments w/implicit actions
- In the wild they asked relevance about URLs, sessions
- They measured scrolling, cutting and pasting, what URLs people visited.

- Browser toolbar
- Smallish scale (100-1000s of users)
- Weeks for data
- in the wild, search interleaved with other tasks
- You can probe about specific tasks and success/failure (some labeled data)
- Challanges - user sample, drop out, some alteration of behavior

Log Analysis (in the large)
- E.g. Query-click logs
- Search engine vs. Toolbar

Search Engine - details of your services (results, features, etc...)
Toolbar - broader coverage of sites/services, less detail (this is more interesting than the first)

Millions of users and queries
real-time data
In the-wild
Benefits: diversity and dynamics of users, queries, tasks, actions
Challenges: logs are very noisy... they tell you what, not why.

Sharable Resources?
* User studies / Panel studies
- Data collection infrastructure and instruments
- Perhaps data
* Log analsysis - Queries, URLs
- Understanding how users interact with existing systems
- What they are doing; Where they are failing, etc..
Implications for retrieval models, lexical resources, interactive systems.
Lemur Query log toolbar.
-- The big issue is: How do you get users?

Operational Systems an experimental platform
- Can generate logs, but more importantly... A/B testing
- Interleave results from different methods

* Can we build a Living Laboratory?
- Web
- Web (search APIs, but ranking experiments are somewhat limited
- If there were concrete proposals she thinks the big three would be willing to open this up more.
- UX perhaps more natural

- Other content resources:
- Wikipedia, Twitter, Scholarly publications, ...
Replicability in the face of changing content, users, queries.

Again: The key is getting users. Someone has to mantain the product.

Closing thoughts
- Today's test collections are very limited with respect to user activities
- Can we develop shared resources to address this?

- Selecting your users is important: what are you likely to have in variablility? .. you need to sample the space in a way that's useful for you.
- Robust Track: looked at variance. - What behaviors precede users switching?

- Do we have a 100 processes of abstractions that characterize what users are doing? It's hard to predict what someone is going to do in an interactive setting. (imagine you want to improve snippets. Start with it in the laboratory setting with eye tracking. Then what behaviors would you observe in logs if you introduced it in the logs.)

- There is interesting discussion about abstracting the framework and modeling tasks and what is driving the behavior.

No comments:

Post a Comment