Thursday, July 22

SIGIR 2010 Industry Day: Being Social: Context-aware and Personalized Info. Access

Being Social: Research in Context-aware and Personalized Information Access @ Telefonica
Xavier Amatriain, Karen Church and Josep M. Pujol, Telefónica

Context overload
- the device of the future for information seeking is no longer the desktop
- it is mobile: iPad, mobile phone.
- Mobile phones are "personal"
- Mobile users tend to seek "fresh" content

Where is the nearest florist?
-- this is pretty easy
-- where is that raelly cool cocktail bar I went to last month? (harder)
-- What bout discovery?
-- Interesting things close to me? Events?

Can we improve the search and disvery expdeirence of mobile users using social information?

Social Search Browser - SSB
- Karen Church
- iPhone web application + Facebook app
- displacies queries/questions by other users in that location
- users can post and interact with queries from others

SSB was a tool for helping and sharing....
A tool for supporting curiousity... an extension to my social Network

But!

Crowds are not always wise. Predictions based on large datasets that are sparse and noisy.

User feedback is noisy
- you can trust if something is excellent, but not necessary the other way around.

"Trust Us - We're Experts
- "It is really only experts who can reliably account for the decisions"
- The Wisdom of the Few - SIGIR '09

Expert-based CF
An expert = individual that we can trust to have produced thtoughful, consistent, and reliable evaluations (ratings) of items in a given domain.

Working prototypes
- Music recommendations, mobile geo-located recommendations...

Summary
- Sometimes the experts are better than your direct social network.

SIGIR 2010 Industry Day: Lessons and Challenges from Product Search

Lessons and Challenges from Product Search
Daniel Rose, A9

Different Domains, Different Solutions
- Traditional IR,
- Enterprise search
- Web search
- Product Search
How are the issues different? Let's go back to user goals...

The Goals of Web Search
- Understsanding user goals in web search paper (WWW 2004).  Manually clustered queries until they were stable.
- Done at AltaVista in 2003 (not completely representative queries)
- Most product queries fell into other categories

Why do people search on Amazon?
- When they want to buy something?

Even ignoring the non-buying issues..

The Goals of the product Search
- Depends on where you are in the buying funnel.
-- Top: awareness, then Desire, then Interest, finally Action
St. Elmo Lewis, 1898
- Provide the right tools at the right stage in the process.

[roller coaster]
- toys and games
- sort by average customer review
- sort by price (is actually hard: new vs. used, amazon vs. third-party, etc...)

Different Tools for Different Stages
- Product search shows more fluid movement between searching and browsing behavior (relying on faceted metadata)
- Because of the nature of the search task?
- Because of the interfaces?

What Amazon Queries Look Like
- [which old testament book best represent the chronological structure]
- [shipping rates for amazon]
- [long black underbust corset] - still looking
- vs ISBN number -> about to buy it

(mostly one word, most the name of a thing.  except "generator")
top 10 across the us
(kindle, kindle fire, skyrim, mw3, sonic generations, cars 2)

queries in frequency deciles, by category
US, books, electronics, apparel
 --> very diverse, mispelling, miscategorization, all levels of the buying funnel

Context is King
- Some facets for Dresses vs. Digital Cameras
- The problem of facet selection
- Not a one size fits all UI solution for different facet types
- We can interpret your query in a smarter way: [timberland] boots inside shoes is a brand
- Timberland in music -> Timbaland the band (context dependent spelling correction)

Amazon is a MarketPlace...
- So search must be realtime
-- new products
-- new merchants
-- prices being changed all the time
-- items going in and out of stock all the time

Structured Data: "It's a gift... and a curse"
- Unlike the web search, we know the semantics of different bits of text
- We know what fields are important for customers (e.g. brand)
- A large degree of quality control (less adversarial problems)
- We don't have to do sentiment analysis to know if a review is positive/negative

A Curse
- Search engine needs to have both DBMS-like "right answer" behavior and IR-like "best answer" behavior
- Tradiontional IR mechanisms don't always work well for structured data
-- e.g. naive tf x idf  doesn't work well (see BM25F)

What happens when one of the fields is order of magnitudes bigger than others?
-- Search inside the book vs. brand name
- What happens when you don't have all the fields all the time? (missing data)
-- ratings, reviews correlate with user satisfaction, but it may not be there

Search Inside the Book
 - how often do you want to surface full-text matches vs. filter them out
 - (example query:  [byte-aligned compression])

Using Behavioral Data
- Powerful source of information for any search engine
- When is using behavioural data an invasion of privacy (or just plain creep), and when is it better for users?
- Customers of a business seem more comfortable with that business learning from past behavior.

Interpreting Behavioral Signals
Example: Are search result clicks good and bad?
- How many clicks are best?
-- 1: the customer found what their are looking for right away
-- many: comparison shopping and are looking around at multiple items
-- zero: the search result contained all the information necessary
Also, some items are inherently "click attractive", e.g. a book with a sexy cover

- "Why is the web so hard... to evaluate" (from snippet evaluation at Yahoo!) 2004

Evaluating Product Search Relevance
Common argument
-- Customers to to a shopping site to buy stuff
-- if a search engine change leads to customers buying mor stuff, they must have had their search need met more effectively.
-- Therefore, relevance can be measured by how much customers buy.
What's wrong with this argument?
-- besides ignoring the rest of the buying funnel, and that someone is ready to buy.

The A/B Test Mystery
- Compare ranking algorithms A and B
- Assign half of users A and half to B
- And the end the avg. revenue is higher in A than B.
-> algorithm A could be better than B, or Algorithm A could be recommending higher priced items than B
-> Algorithm A could be recommending completely unrelated, but very popular items.

So How to do Evaluation?
 - A/B tests, automated metrics, editorial relevance assessments (possibly crowdsourced).
 - Use all of them!

Lessons from IR
One idea: Generalizing the buying funnel
- The information seeking funnel
- Wandering: no information seeking goal in mind
- Exploring: have a general goal, but not a plan on achieving
-Seeking: have started to identify info needs that must be satisfied, but needs are open-ended
-Asking: have a very specific information need corresping to a closed class question
- Published in: The information seeking funnel, in Information-seeking support systems workshop 2008.

Summary
- Start thinking about how to meet user needs before user knows she has a need
- Offer different interaction mechanisms for different parts of the information seeking process
- Let type of content influence the way search works
- Design for realtime
- Interpret behavioral data carefully
- Exploit structure when have it
- Exploit context when you have it

(My Thoughts and Questions)
 - The world is not only Amazon.  What about linking the products to external sources, like consumer reports, dpreview and other sites?
  --> Amazon enhanced Wikipedia (e.g. Orson Scott Card)
 - Social, how is amazon incorporating social search?
 --> delicate balancing act with Facebook and other sources


 - Do you try and leverage mentions of products on book review sites? or within other books?
 - I recently went to barnes and noble and saw the new Orson Scott Card book, one of my favorite authors.  Why didn't Amazon surface that to me? (support for subscribing to authors)  Or, "buy the new top picks from this month's Cook's Illustrated"...
 - From my perspective, the recommendation quality of Amazon has decreased over time despite more of my data.  Does this reflect a shift in emphasis?






Microsoft Releases Learning to Rank Datasets

Microsoft Research announced that it is releasing a new MS LTR dataset.
We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

136 features have been extracted for each query-url pair.
The dataset is a retired dataset. What makes this quite interesting is that the features have been released. You can see the feature list.

See also the Y! LTR datasets.

SIGIR 2010 Industry Day: Machine Learning in Search Quality at Yandex

Machine Learning in Search Quality at Yandex
Ilya Segalovich, Yandex

Russian Search Market
- Yandex has 60+% market share
- It's all about small attention to details about the search

A Yandex overview
- started in 1997
- no 7 search engine in the world by # of queries
- 150 million queries per day

Variety of Markets
- 15 countries with cyrillic alphabet
- 77 regions in Russia
-> different culture, standard of living, average income, for example: Moscow, Magadan
-> large semi-autonomous ethnic groups (tatar, chech, bashkir)
-> neighbouring bilingual markets

Geo-specific queries
- Relevant result sets very significantly across regions and countries

pFound
- a probablistic measure of user satisfaction
- optimization goal at Yandex sinces 2007
- Similar to ERR, Chapelle 2009 --> hopefully someone can fill in the exact formula
- pFound, pBreak, pRel

Geo-specific Ranking
query -> query + user's region
- may need to build a specific formula for countries/region because of the variance and missing/lacking features in some of them.

Alternatives in Regionalization
- separate local indices or unified indx with geo-coded pages
- one query or region specific query
- query based local intent detection vs. results based local intent detection
- single ranking function vs. co-ranking and re-ranking of local results
- train one formula or train many formulas on local pools

Why use MLR?
Machine learning as a conveyor
- Some query classes require specific ranking
- many features

MatrixNet
A learning method
- boosted decision tree, "oblivious" trees.
- optimize for pFound
- solve regression tasks, train classifiers

Complexity of ranking formulas
20 bytes - 2006
14 kb - 2008
220 kb - 2009
120 MB - 2010

A sequence of More and More complex rankers
- pruning with the static rank (static features)
- use of simply dynamic features (such as bm25)
- complex formula that uses all the features available
- potentially up to million of matrices/trees for the very top documents
- see camazoglu, 2010 early exit optimization

Geo-dependent queries: pFound
- a big jump in 2009 in Quality
- 3x more local results than competitors in Russia, than #2 player

Lessons
- MLR is the only to regional search: it provides us the possiblity of tuning many geo-specific models at the same time.

Challenges
Complexity of the models is increasingly rapidly
-> don't fit into memory!

MLR is in its current setting does not fit well to time-specific queries
-> features of the fresh content are very sparse and temporal

Opacity of results of the MLR
- The backside of ML

Number of featuers grows faster than the number of judgments
-> hard to train ranking

Learning from clicks and user behavior is hard
Tens of GB of data per day!

Yandex and IR
- Participation and Support
- Yandex MLR at IR context

SIGIR 2010 Industry Day: Query Understanding at Bing

Query Understanding at Bing
Jan Pederson

Standard IR assumptions
- Queries are well-formed expressions of intent
- Best effort response to the query as given
Reality: queries contain errors
- 10% of queries are mispelled
- incorrect use of terms (large vocabulary gap)

Users will reformulate
- if results do not meet information need
Reality: If you don't understand what's wrong you can't reformulate. You miss good content and go down dead ends

- Take the query, understand what is being set and modify the query to get better results

Problem Definitions
- Best effort retrieval
-- Find the most relevant results for the user query
-- Query segmentation
-- Stemming and synonym expansion
-- Term deletion

Automated Query Reformulation
- Modify the user query to produce more relevant results for the inferred intent
-- spell correction
-- term deletion
-- This takes more liberty with the user's intent

Spelling correction
Example: blare house
- corrected to "blair house". There is a "recourse link" because the query was changed to back out.

Stemming
- restaurant -> resturants
- sf -> san francisco

Abbreviations
-> un jobs -> united nations (may already be there in anchor text)
- utilize co-click patterns to find un/united nations for that page
- it is especially important for long queries, tail queries
- not so good: federated news results for the query. Is the same query interpretation being used consistently? The news vertical did not perform expansion and there is a problem.

Term Relaxation
[what is a diploid chromosome] -> "what is a" is not important from the SE matching, it introduces noise

[where can I get an iPhone 4] -> where is an important part of the query. Removing "where" misses the whole point of the query

[bowker's test of change tutorial] -> test of symmetry is the correct terminology. How do you know that it is the incorrect terminology? If you relax it to "bowker's test" you get better results

Key Concepts
- Win/Loss ratios
-- wins are queries whose results improve
-- losses are queries whose results degrade
- Related to precision
-- but not all valid reformulations change results

- Pre vs. Post result analysis
-- Query alternatives generated pre-results
-- Blending decisions are post results

Query Evaluation
"Matching" level 0/l1/l2 -> inverted index, matching and ranking. Reduce billions to hundreds of thousands of pages. Much of the loss can occur here because it never made it into the candidate set. Assume that the other layers that use ML, etc... will bubble the correct results to the top.

"Merge" layer L3 -> the blending with multiple queries will be brought together

Federation layer L4 -> top layer coordinating activity

An important component is the query planner in L3 that performs annoation and rewriting.

Matching and Ranking
L0/l1 - 10^10 docs. l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)

L2 reranking - 10^5 docs - ML heavily lifting: 1500 features, proximity

L3 reranking - 10^3 - federation and blending

L4 -> 10^1

Learning to rank Using Gradient Descent (for L2 layer)

Query Annotation

NLP Query annotation
- offline analysis
- Think of the annotations as a parse tree

Ambiguity preserving
- multiple interpretations

Backend independent
- shared

Structure and Attributes
- Syntax and semantics (how to handle leaf nodes in the tree)

Query Planning

[{un | "united nations"} jobs] -> l3-merge(l2-rank([un jobs]), l2-rank(["united nations" jobs])
OR
[{un| "united nations"} jobs] -> l3-cascade(threshold, l2-rank([un jobs]), l2-rank(["united nations" jobs])
-- the second is less certain and conditional

Design Considerations
Efficiency
- one user query may generate multiple backend queries that are merged in L3
- Some queries are cheaper than others
-- query reduction can improve performance

Relevance
- L3 merging has maximal information, but is costly

Multiple query plan strategies
- Depending on query analysis confidence

Query Analysis Models

Noisy Channel Model
argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}

-- bayes inversion

- example: spelling
-- languagel model: likelihood of the correction
-- translation model: likelihood of the error occurring

Language Models
Based on Large-scale text mining
-- unigrams and N-grams (to favor common previously seen things, they make sense)
-- Probability of query term sequence
-- favor queries seen before
-- avoid nonsensical combinations

1T n-gram resource
-- see MS ngram work here in SIGIR

Translation Models
- training sets of aligned pairs (mispelling/correction; surface/stemmed)

Query log analysis
-- session reformulation
-- coclick-> associated queries
-- manual annotation

(missed the references, but see: Wei et al, Pang et al., Craswell)

Summary
- 60-70% of queries are reformulated
- Can radically improve results

Trade-off between relevance and efficiency
- rewrites can be costly
- win/loss ratio is the key

Especially important for tail queries
- no metadata to guide matching and ranking

SIGIR 2010 Industry Day: Search Flavours at Google

Search Flavours: Recent updates and Trends
Yossi Matias
Director of Israel R&D Center, Google

Solution for the search problem: imitate a person

Wish list
- knows everything
- language agnostic
- always up to day
- context sensitive
- understands me
- Good sense of timing
- Good sense of scope
- Smart about interaction

(Suggest answers to questions I didn't ask or didn't ask accurately)
In short, things we expect from people when we interact from experts or friends. This is subtle.

Demo of things
- auto suggest of weather; an intelligent guess at what the user will ask
- flight information for ua 101
- weather in the suggestion
- This is new because the user does not have a chance to finish the question
- How do we understand the feedback when they don't have any feedback (except to maybe stop typing)

- Being local [restaurant] (implicit context)
- world cup (now is a general answer, but a week or two ago it was very different)
- new forms of information: user generated content in real-time, Twitter
- [whale] it turns out there was a whale jumping on the ship
- Google trends shows hot topics

Greater Depth With Real-Time
- Example of an earthquake.
- Two minutes after an earthquake, tweets were surfacing in the results before a formal announcement


How?
-- quick slide showing a chart, which he's not going into

Social Circle Personalization
- someone I know blogs about something or a picture, surface it

Understanding: What does Change mean?
- change = to adjust (adjust the brightness) , or convert, or switch all depending on the context

Paul McCartney Concert
- uploading real-time video from the concert
- A few may be good, but we don't want 300 clips all from the same concert

Web translation
- language agnostic
- NY Times translated into china
- translated search
- automatic captioning (translation of obama speech to add arabic captions)

Search by voice... any Voice
- People are starting to use it.
- How do you do it for any person, any language?
- The combination of voice search and translation is almost like science fiction
- This is a significant technology worth paying attention to

Search by Sight
- Google Goggles
- Mobile is important for contextual understanding (location)
- Phones are starting to take on behavior of smart agents
- 10 or 20 results are not useful on a smart phone, "im feeling lucky" is important

The power of data
- 1.6 billion Internet users
- A billion searches a day on Google worldwide
- He started working in ML and data mining
- From a research perspective there is a massive benefit of working with it

Trendonomics

Timeliness
- how to leverage trends of data, such as user search to derive insights
- Trends over time, location, etc.
- Identify outbreaks of flu: find queries that correlate with CDC reports
- Google could predict the outbreak two weeks ahead of CDC, a heads up of something happening now.
- Nowcasting: forecasting the present based on information from the past
- Hal Varian: Predict economic indicators before they were published by the Govt.

Real Estate
- Using stastical models to provide up to the minute information on where we are on economic indicators for sectors of real-estate.
- It doesn't always work, but it's helpful

2010 World Cup - new Search
- popularity of David Villa, etc...
- South Africa, and sponsors getting attention

Researching Search Trends Time-Series
- Forecasting. Seasonality is a common case. Many queries have strong seasonal components (yearly/ weekly cycles)
- we can use time-series prediction models to forecast
- (e.g. skiing, sports)

- Define notions of how predictable and regular the search queries are
- About half the search queries are predictable in a 12 month ahead forecast with a mean abs prediction err of 12% on average

Health, Food & Drink, and .. are quite seasonal.

- Categories are more predictable than individual queries

Deviation from modeled prediction
- US automative industry, forecasting: august o8 - July 09
- The maintance and parts was ahead of forecast, new sales were below

See papers (a big long list...)
...
What can search predict? many publications by Hal Varian

There is no API, but it is possible to download. They are encouraging collaboration with researchers.

Big themes of the talk:
- real-time is expected ('local'), mobile access

SIGIR 2010 Best Paper Award Winners

The best paper awards were awarded last night at the banquet.

In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, including trail recommendation systems that display trails on search result pages.
Best Student Paper
A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose
The main goal is to determine whether the behavioural differences of users have an impact on the models' ability to determine topical relevance and if, by personalising them, we can improve their accuracy. For modelling relevance we extract a set of features from the facial expression data and classify them using Support Vector Machines. Our initial evaluation indicates that accounting for individual differences and applying personalisation introduces, in most cases, a noticeable improvement in the models' performance.

SIGIR Industry Day: Baidu on Future Search

Future Search: From Information Retrieval to Information Enabled Commerce
William Chang, Baidu

Two commerce revolutions
- 1995 the first web search engines (ebay, amazon, etc...)
- China miracle

Early History of IEC
- Early shippers: created corporations, but more important there is a futures market
- Commerce: coming together to trade: trading goods and information
- Local: Yellow pages created in 1886
- Local classified ads in papers
- Mail order: Sears catalogue in 1888 for farming supplies (enabled by efficient postal service)
- Credit cards: consumer production and data mining
- Development of "advertising science": print, radio, tv

I.E.C in our Daily Lives
- Restaurant menus
- Zagat, Michelin
- Shopping guides, supermarket aisles

Technology and Internet
- Walmart: real-time transaction tracking and inventory management; scale and speed
- Amazon: user generated reviews and recommendations, common business platform
- eBay
- Craigslist

Search Engines
- Y! Directory
- Lycos Crawler, Altavista big index, Excite HotBot
- Infoseek (1996-1999) where he worked
- OR queries
- Phrase inference and query rewriting
- Banner ads tied to search keywords
- real-time addurl
- anti-spam (adversarial IR)
- hyperlink voting and anchor text indexing
- log analysis and query suggestion
- Goto.com / Overture paid placement
- Google ad platform: AdWord, AdSense

Search as Media

Working 'defintion of a media company (1997)
A media company's business is to help other businesses build brands, and a brand is the total loyalty of the company's customers. A "new media company" does this by leveraging the interactive nature of the Internet to enable users to communicate with one another..."
China Economics

China Background
- reality: only 15% of Internet users earn 5000/year
- inflation at 5% spurts of hyper-inflation
- education and personal aspiration: virtually no illiteracy, but there is a problem with brain drain (about a million of the best and brightest left and never came back)
- competition is fierce in school and in work
- gender equality: one child policy
- entrepreneurial spirit

The Economy
- GDP is growing 10% annually
- Despite a tradition of honoring "old brands" there are few new domestic brands and little marketing know-how
- Domestic commerice is still nascent, lacking IEC tools (no yellow pages or directories that work)

The Prize
- Highly developed Internet in user and usage count: 420 million users, 85% broadband, the average spends 20/week on the internet
- Sitra (sp?) the expedia of China
- Micropayments are made via phone bills: even children use it to buy games online

The Money
- Half the Internet population is under 25
- Tencent QQ is an IM used by everyone, virtual currency, with real economics
- Online games: Shanda, Giant etc:
- Taobao/Alibaba already 1% of GDP, dominates B2C goods
- Baidu web search dominates B2C services (health, education... help on cramming)
- China mobile: everyone uses it, and for almost everything
- Ctrip: integration of online, mobile, offline services

Baidu
- Aladdain: Open Search Platform (allows webmasters to submit query and content pairs)
-- rich results that form an application

- iKnow (2005) an open Q&A platform: the largest in the world
-- has many partner websites, all with a Q&A panel on their website

- Ark: Open product database

- Map++ (embed yellow page like information on a map)

Baidu Aladdin:
Travel
- On the result page, there is a full panel with airline reservation
Housingr, Shopping

A few more ideas:
- The average chinese worker spends 2-3 hours per day on public transportation. They spend the time playing games or reading "new literature". This is an opportunity for mobile shopping recommendation
- Shopping malls are almost impossible to navigate. There are no directories or ways to find things

Conclusions
- Depends critically on information quality and security: spam
- Users demand quality, but there are still not solid reliable brands
- There are new novel business models to explore: a trillion dollar opportunity

Wednesday, July 21

SIGIR 2010 Keynote: Donna Harman on Cranfield Paradigm

Is the Cranfield Paradigm Outdated?
by Donna Harman, NIST

Cranfield 1 - (1958 - 1960?)
- Missed most of this due to a late bus.

Cranfield 2 - 1962-1966
Goal: learn what makes a good descriptor
new user model: researcher wanting all documents relevant to their question
Documents: 1400 recent Papers in aeronautical engineering

Questions gathered from authors of the papers, asking for the basic problem the paper addressed and also supplemental questions that could have been put to an information services

Full relevance assessments at 5 levels
- complete answer to a question
- high degree of relevance... necessary for the work
- useful as background
- minimal interest, historical interest only
- no interest

Hundreds of manual experiments using different combination of the index terms specificicity, etc., etc.

Metrics used were recall ration and precision ratio (set retrieval)

The results said we could just use the words in the document (used title and abstract)

Cranfield paradigm (defined)
- Faithfully model a real user application, in this case searching appropriate abstractions with real questions
- have enough documents and queries to allow significant testing on results
- building the collection before the experiments in order to prevent human bias and enable re-usability
- define a metric that reflects real user

Continuation in SMART project
- Mike Keen spent time at Cornell working on new collections
- (a description of SMART Test Collections)
- They found there was only a 30% agreement between questioners and assessors, but there was no significant difference in how the systems ranked.

Continuation in TREC
- In 1990 DARPA asked NIST to build a new test collection for the TIPSTER project
- User model: intelligence analysts
- large numbers of newspaper articles
- TIPSTER Disk 1 and 2 (mixed short and long documents to force people to focus on length normalization and scale up to full-text from abstracts)
- Topics 1-50 were training topics. Topics 51-100 were created by one person

Relevance Judgments
- Used pooling (took the top 100 docs from each run).

Overlap for 8 years of Adhoc
- The queries from Trec-1 to Trec-8 got progressively narrower with few relevant documents

What is relelvant?
- Back to the user model
- A document is relelvant if you would use it in a report in some manner
- This means that even if only one sentence is useful, the document is relevant
- "Duplicates" also relevant as it would be very difficult to define and remove these

How complete is the relevant set? (Tipster)
- some relevant documents are not in pools
- But, lack of bias in pools is crucial so that systems that don't contribute to the pool can be fairly judged

Other Relevancy issues (Tipster)
- Relevancy is time and user dependent
- learning issues, novelty issues
- user profiles issues such as prior knowledge, reason for doing search, etc...
- TREC picked the broadest definition of relevancy for several reasons
- it fit the user model well
- it was well-defined and thus likely to be followed
- thousands of documents must be judged quickly (300 documents per hour)
- (Keep these lessons in mind when using Mechanical Turk)

TREC Genomics Track
- User Model: medical researchers working with MDELINE and full-text journals

Topics: Started with a user survey looking for questions
- Included topics based on 4 generic topic type templates and instantiated from real user requests
System response
- ranked list of up to 1000 passages (pieces of paragraphs)

TREC Legal Track
- Very dependent on user model. It is modeled after actual legal discovery practice with topics and relevance judgments don by lawyers
- Documents: 7 millions messy XML records on tobacco
- Topics: hypothetical complaints
- Relevance judgments: from pool created by sampling
- Metrics: set retrieval, F @ k

Others: NTCIR, ImageCLEF, INEX - the requirements are all determined by the user model

TREC Web Tracks
- Initially used ad hoc user model, just scaled up to 100 GB
- Then scaled to 426 gigabytes
- judgments unlikely to be complete
- possible bias in relevant documents

Cranfield Paradigm outdated??
- Faithfully model a real user application
- However, we need to think outside the current implementations of Cranfield paradigm to find new user models for the web

User Tasks and Types
- Trec-6, Allan et al. on ranked linked vs. vizualization
- Bhavnani, TREC 2001: med librarians a and cs studnets
- White, Dumais, and Teevan: Large-scale log studies looking at how domain experts search such as vocabularly, resources, et...
- Alonso and Mizzaro SIGIR '09 -- Interesting results on what users find important qualities of result sets
- Lin & Smucker, SIGIR '09: PubMed study
- Using logs to determine goals: Rose and Levinson, WWW 2004 manually classified search goals from the Y! logs
- Others...
- Guo, White, Dumai, Wang & Anderson at RIAO 2010: predicting query performance based on user interaction features

Diversity study using logs
Clough, et al. SIGIR '09 poster, work in WSCD '09 to study diversity, ambiguity in MS log
- Size of wikipedia article on the topic and query reformulations indicated diversity
Bendersky & Croft WSCD'09 - work on describing long queries.

How can we apply these lessons?

Ad hoc experiments must continue
- There are many different access needs that are basically traditional ad hoc retrieval; specific tasks, long queries, etc...
- Scores in Robust, etc. still not good. We know that there are "easy" things that could be done to improve results significantly: you need to be better at term weighting, stemming, needs relevance feedback, etc...
- However, we need to think more about other information access methods, especially on the web/mobile phone, etc.

ClueWeb 09
- If we are going to do "ad hoc" retrieval, where can we enough "enough" of the "right" topics?
- How do we get relevance judgments; is it possible to sample and still have "reusable"?
- Is reusable important; how do we reconcile the fact that users nly look at the top (the web user model) with the reusability of a collection?
- search engines only judges the very top

What else should we look at in web track?, Specific subsets of the web.

Retrofit TREC etc. collections

User Simulation
- Lin & Smucker suggested hat Cranfield is only one model for user simulation; that new test collections could be built for other user models
- We have log studies, plus examples of feature tables from log studies to provide some reality

Cranfield Paradigm not outdated
- We still need to work on ad hoc!
- But, we have to look at new web user models
-- focus on specific web queries where we can contribute (e.g. not 'britney spears')
- We also need to think outside the ranked list mindset; surely that is not all there is!!

Tuesday, July 20

Amit Singhal on the Evolution of Search: Searching without Searches

Engadget has an article covering a presentation (no details provided) given by Amit Singhal on the evolution of search. Most of the interview outlines the evolution of search towards multimedia, real-time search, etc... Most of it has been well covered in the past. One interesting note is that Amit outlines his vision for one possible future direction of search.
Your phone knows about your shopping needs because they're in your to-do list and it knows about your meetings because they're in your schedule. All it needs is your location (which, of course, it has) and some local area information, and it'll ping out a message advising you that you can just pop down the road, buy that wooden stick, and be back in time for your 2PM with Marty from the Synergy Department.
The search engine detected an implicit information need from your to do list that it could satisfy efficiently. It's still a distant dream.

SIGIR 2010 Keynote Address: Refactoring Search by Gary Flake

SIGIR 2010 coverage is starting. You can also follow the coverage on Twitter, #sigir2010. Here are the raw notes from the first keynote address.

Refactoring Search by Gary Flake
- aka Zoomable UIs, IR, and the Uncanny Valley
- Bing search meets Pivot.

- 50 gigs of scans of the seattle intelligencer, 600 dpi
- a proof of concept.

- take raw data; combine it with metadata for faceted navigation
- A look at Census data on death. A very novel way of navigating the dataset
- Between search and browsing.

Web Search Retrospective

What's worked well:
- instant answers
- spell correction
- vertical tabs
- query suggestions
- query completion
- grouping results

- The biggest improvement is in overall index scale
- Some improvement in core relevance
- But, this list is actually pretty modest

What hasn't worked as well
- Natural language queries
- richer representations for results
- richer presentation for one result
- clustering (visual or otherwise)

A lack of fluidity is part of the problem

Grokker (RIP, 2009)
- The sexiest search experience that no one was going use.

Instead of discreet shifts from one query to the next, can we make it a more fluid interactive process.

Uncanny Valley
- As you increase the sophistication it becomes mre pleasing until it becomes "too real" and then they feel like zombies.

Discrete vs. Continuous Interactions
CGI: stick figures -> The Simpsons -> Toy Story -> Polar Express -> Avatar
UIs: text terminal -> web 1.0 -> Rich client -> over ambitious ajax -> Good Zoomable UIs
Search: Grep -> Altavista -> present day search engines -> Grokker -> ???

Surpassing the unvacnny valley is exceedingly difficult because it requires excellence in science, technology, and.

Our dilemmas
- We are already familiar with the dilemma of precision and recall.
- There exists a similar dilemma around scale, fluidity, and complexity.

Zoomable UIs and Similarities to IR

DoopZoom items
* each tile is an image file
* each level is a set of image files in a folder
* each pyramid is a set of folders with image tiles for each level

DoopZoom collections
* thumbnails are packed onto shared tiles
* loading 100s of images requires loading few tiles.
* very simple: hierarchical file structure with XML description and metadata (no db)
- The fidelity of the experience is independent from the size of the object.
- The trick on the back end is to build the pyramid on the backend


The net outcome is that the user feels in control. "It's like having super human powers" to change levels of

Why user control is essential
- they feel empowered to explore
- Actions are more clearly invertible

Lessons from ZUIs to apply to IR
- Preprocess on the backend
- Assume the front end can do a lot
- Build UI around continuous interactions
- Use asynchronous I/O between endpoints
- Use the two in combination to reinforce one another (left versus right brain)

Higher level goals
- Turn the present discrete mode of interaction of search into a continous dialogue
- Support fluid interactions that are powerful, informative, and fun
- Scale to thousands of items within the user / client interactions

The biggest challenge is in dynamic generation of collections on the backend

Server-side IR problems
- ranking, facet determinations
- cleaning / augmenting bipartite graph

Pivot + search architecture
- Uses the Bing API + thumbnail cache to use Pivot to explore search results
- The UI supports a novel way of analysing a larger corpus of pages across multiple queries.
-- e.g. dpreview.com is prominent across multiple different queries about camera reviews

First: do no harm
- Linear order must be obvious
- First result or instance answer is prominent
- First 4 or so items are easily visible
- Preserve title / url / description format

Next modestly improve
- handle more results: > 50
- basic n-gram extraction
...

Not done
- Documents classes as facets
- Document similarity as synthestic facets
- Folksonomies and community tags
- Federation and verticals

Viciious cycle of the web
- easy to create -> more people create
- More stuff created - harder to find good stuff
....

What's the cure?
We desperately need a mode of interaction where the whole of the data is greater than the sume of the parts.

Wisdom > knowledge > information > data

Q&A
- For facets: word frequency with stop words from abstracts and titles (with just a little cleanup)

Monday, July 19

Headed to SIGIR 2010

I'm leaving for Geneva today to attend SIGIR. I look forward to seeing you there! I will be live-blogging the keynote talks (subject to WiFi availability) and providing other coverage. I will also be tweeting.

Today is tutorial day. The main talks start tomorrow. To get started, here are the best paper nominees from the website.
  • A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose

  • Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs, R. White, J. Huang

  • Caching Search Engine Results over Incremental Indices, F. Junqueira, R. Blanco, E. Bortnikov, R. Lempel, L. Telloli, H. Zaragoza

  • Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski, N. Craswell

  • Extending Average Precision to Graded Relevance Judgments, S. Robertson, E. Kanoulas, E. Yilmaz

  • Information Based Model for ad hoc information retrieval, S. Clinchant, E. Gaussier

  • Multi-style language model for web scale information retrieval, K. Wang, J. Gao, X. Li

  • Properties of Optimally Weighted Data Fusion in CBMIR, P. Wilkins, A. Smeaton