Wednesday, July 21

SIGIR 2010 Keynote: Donna Harman on Cranfield Paradigm

Is the Cranfield Paradigm Outdated?
by Donna Harman, NIST

Cranfield 1 - (1958 - 1960?)
- Missed most of this due to a late bus.

Cranfield 2 - 1962-1966
Goal: learn what makes a good descriptor
new user model: researcher wanting all documents relevant to their question
Documents: 1400 recent Papers in aeronautical engineering

Questions gathered from authors of the papers, asking for the basic problem the paper addressed and also supplemental questions that could have been put to an information services

Full relevance assessments at 5 levels
- complete answer to a question
- high degree of relevance... necessary for the work
- useful as background
- minimal interest, historical interest only
- no interest

Hundreds of manual experiments using different combination of the index terms specificicity, etc., etc.

Metrics used were recall ration and precision ratio (set retrieval)

The results said we could just use the words in the document (used title and abstract)

Cranfield paradigm (defined)
- Faithfully model a real user application, in this case searching appropriate abstractions with real questions
- have enough documents and queries to allow significant testing on results
- building the collection before the experiments in order to prevent human bias and enable re-usability
- define a metric that reflects real user

Continuation in SMART project
- Mike Keen spent time at Cornell working on new collections
- (a description of SMART Test Collections)
- They found there was only a 30% agreement between questioners and assessors, but there was no significant difference in how the systems ranked.

Continuation in TREC
- In 1990 DARPA asked NIST to build a new test collection for the TIPSTER project
- User model: intelligence analysts
- large numbers of newspaper articles
- TIPSTER Disk 1 and 2 (mixed short and long documents to force people to focus on length normalization and scale up to full-text from abstracts)
- Topics 1-50 were training topics. Topics 51-100 were created by one person

Relevance Judgments
- Used pooling (took the top 100 docs from each run).

Overlap for 8 years of Adhoc
- The queries from Trec-1 to Trec-8 got progressively narrower with few relevant documents

What is relelvant?
- Back to the user model
- A document is relelvant if you would use it in a report in some manner
- This means that even if only one sentence is useful, the document is relevant
- "Duplicates" also relevant as it would be very difficult to define and remove these

How complete is the relevant set? (Tipster)
- some relevant documents are not in pools
- But, lack of bias in pools is crucial so that systems that don't contribute to the pool can be fairly judged

Other Relevancy issues (Tipster)
- Relevancy is time and user dependent
- learning issues, novelty issues
- user profiles issues such as prior knowledge, reason for doing search, etc...
- TREC picked the broadest definition of relevancy for several reasons
- it fit the user model well
- it was well-defined and thus likely to be followed
- thousands of documents must be judged quickly (300 documents per hour)
- (Keep these lessons in mind when using Mechanical Turk)

TREC Genomics Track
- User Model: medical researchers working with MDELINE and full-text journals

Topics: Started with a user survey looking for questions
- Included topics based on 4 generic topic type templates and instantiated from real user requests
System response
- ranked list of up to 1000 passages (pieces of paragraphs)

TREC Legal Track
- Very dependent on user model. It is modeled after actual legal discovery practice with topics and relevance judgments don by lawyers
- Documents: 7 millions messy XML records on tobacco
- Topics: hypothetical complaints
- Relevance judgments: from pool created by sampling
- Metrics: set retrieval, F @ k

Others: NTCIR, ImageCLEF, INEX - the requirements are all determined by the user model

TREC Web Tracks
- Initially used ad hoc user model, just scaled up to 100 GB
- Then scaled to 426 gigabytes
- judgments unlikely to be complete
- possible bias in relevant documents

Cranfield Paradigm outdated??
- Faithfully model a real user application
- However, we need to think outside the current implementations of Cranfield paradigm to find new user models for the web

User Tasks and Types
- Trec-6, Allan et al. on ranked linked vs. vizualization
- Bhavnani, TREC 2001: med librarians a and cs studnets
- White, Dumais, and Teevan: Large-scale log studies looking at how domain experts search such as vocabularly, resources, et...
- Alonso and Mizzaro SIGIR '09 -- Interesting results on what users find important qualities of result sets
- Lin & Smucker, SIGIR '09: PubMed study
- Using logs to determine goals: Rose and Levinson, WWW 2004 manually classified search goals from the Y! logs
- Others...
- Guo, White, Dumai, Wang & Anderson at RIAO 2010: predicting query performance based on user interaction features

Diversity study using logs
Clough, et al. SIGIR '09 poster, work in WSCD '09 to study diversity, ambiguity in MS log
- Size of wikipedia article on the topic and query reformulations indicated diversity
Bendersky & Croft WSCD'09 - work on describing long queries.

How can we apply these lessons?

Ad hoc experiments must continue
- There are many different access needs that are basically traditional ad hoc retrieval; specific tasks, long queries, etc...
- Scores in Robust, etc. still not good. We know that there are "easy" things that could be done to improve results significantly: you need to be better at term weighting, stemming, needs relevance feedback, etc...
- However, we need to think more about other information access methods, especially on the web/mobile phone, etc.

ClueWeb 09
- If we are going to do "ad hoc" retrieval, where can we enough "enough" of the "right" topics?
- How do we get relevance judgments; is it possible to sample and still have "reusable"?
- Is reusable important; how do we reconcile the fact that users nly look at the top (the web user model) with the reusability of a collection?
- search engines only judges the very top

What else should we look at in web track?, Specific subsets of the web.

Retrofit TREC etc. collections

User Simulation
- Lin & Smucker suggested hat Cranfield is only one model for user simulation; that new test collections could be built for other user models
- We have log studies, plus examples of feature tables from log studies to provide some reality

Cranfield Paradigm not outdated
- We still need to work on ad hoc!
- But, we have to look at new web user models
-- focus on specific web queries where we can contribute (e.g. not 'britney spears')
- We also need to think outside the ranked list mindset; surely that is not all there is!!


  1. Anonymous5:06 AM EDT

    Donna Harman, not Harmon

  2. Fixed. Thanks for the correction.

  3. Alonzo and Mizarro SIGIR '09 should be
    Alonso and Mizzaro SIGIR '09 (two corrections)


  4. I appreciate the fixes.

    To keep up with the talk, I go for content over spelling. Sometimes I have time to go back and proofread, but often I don't.

  5. And I appreaciate your posts, you're doing a great job. Thanks,