Tuesday, February 3

Google Research Entity Annotations of the KBA Stream Corpus (FAKBA1)

I'm happy to announce that our Google Research is releasing the largest collection of entity-linked data every made publicly available. The dataset can be used for a wide variety of information retrieval and information extraction tasks.  

The Freebase Annotations of the TREC KBA Stream Corpus 2014 (FAKBA1) contains over 9.4 billion entity annotations from over 496 million documents. More details, including a link to download the data are available at:

This data set is an important data release because entity linking can be an expensive process that is difficult for researchers to perform at scale.  

The KBA Stream Corpus was designed to help track and filter important updates about entities as they change over time.   The goal of KBA was to recommend edits to Wikipedia editors based incoming streams of news and social media.  One of the tasks in the track is the "Cumulative Citation Recommendation" (CCR) task, whose goal is to recommend cite-worthy articles to editors.  There are also extraction tasks, Streaming Slot Filling, which suggests changes to an entity profile (similar to updating a Wikipedia infobox). 

In order to facilitate research in this field, we annotated all of the English documents from the TREC KBA Stream Corpus 2014 (http://trec-kba.org/kba-stream-corpus-2014.shtml) with entity links to Freebase. The entity links are resolved automatically, and are imperfect. For each named entity recognized we provide: the mention text, begin and end byte offsets, Freebase MID, and confidence scores. The dataset includes manual annotations of the TREC KBA CCR 2014 entity queries (in TSV format) that I performed. 

FAKBA1 has 394,051,027 documents with at least one entity annotated. There are over 9.4 billion entity mentions with links to Freebase. 

Although it's early, the dataset has a variety of possible applications, including:
  • TAC Knowledge Base Population Tasks - The goal is to construct a knowledge base, including tasks such as entity linking.  There is a new Tri-lingual track (Spanish, English, and Chinese) being planned for 2015.
  • TREC Temporal Summarization - A track focused on summarizing major world events.
  • TREC Dynamic Domain - A track focused on high-recall filtering, the KBA annotations could be used with the Local Politics vertical.
I hope the data set has broad applications to many researchers!  

You can also stay up-to-date about future releases by reading: 

Monday, July 7

Susan Dumais SIGIR 2014 ACM Athena Award Lecture

Sue Dumais
 - Introduced by Marti Hearst

Putting the searcher back into search

The Changing IR landscape
 - from an arcane skill for librarians and computer geeks to the fabric of everyone's lives

 - How should the evaluation metrics be enriched to characterize the diversity of modern information systems?

How far we have come....
 - The web was a nascent thing 20 years ago.  
 - In June '94; there were 2.7k websites (13.5% were .com)
 - Mosaic was one year old
 - Search in 1994: 17th SIGIR;  text categorization, relevance feedback and query expansion
 - TREC was 2.5 years old (several hundred k newswire; federal register)
 - TREC 2 and 3, the first work on learning to rank
 - Size of Lycos debuted (Fuzzy Malden), # web pages - 54k pages (first 128 characters)
   --> 400k pages, to 10s of millions
   --> The rise of infoseek, altavista
 - Behavioral logs: #queries/day: 1.5k

Today, search is everywhere
 - Trillions of webpages
 - Billions of searches and clicks per day
 - A core part of everyday life; the most popular activity on the web. 
 - We should be proud, but... search can be so much more.

Search still fails 20-25% of the time.  And you often invest way more effort than you should. Once you find an item, there is no opportunity to do anything with it. 
- Requires both great results and great experiences (understanding users and whether they are satisfied)

Where are the Searchers in Search?
 - A search box to results
 - But, queries don't fall from the sky in an IID fashion; they come from people trying to perform tasks at a particular place and time. 
 - Search is not the end goal; people search because they are trying to accomplish something. 

 - Cranfield style test collections
 - "A ruthless abstraction of the user .."
 - There is still tremendous variability across topics. 
 - What's missing?
  --> Characterization of queries/tasks
       -- How are they selected?  How can we generalize to?
  --> We do not tend to have searcher-centered metrics
  --> Rich models of searchers
  --> Presentation and interaction

Evaluation search systems

Kinds of behavioral data
 - Lab studies  (detailed instrumentation and interaction)
 - Panel studies (in the wild; 100s to 1000s; special search client)
 - Log studies (millions of people; in the wild, unlabeled) - provides what, not why

Observational study
 - look at how people interact with an existing system; build a model of behavior.

Experimental studies
 - compare existing systems; goal: decide if one approach is better than another

Surprises in (Early) web search logs
 - Early log analysis...
  --> Excite logs in 1997, 1999
  --> Silverstein et al. 1998, 2002
  --> web search != library search
  --> Queries are very short, 2.4 words
  --> Lots of people search about sex
  --> Searches are related (sessions)

Queries not equally likely
 - excite 1999; ~2.5 million queries
 - top 250 queries are 10% of the queries
 - almost a million occurred once
 - top 10: sex, yahoo, chat, horoscope, pokemon, hotmail, games, mp3, weather, ebay
 - tail: acm98; win2k compliance; gold coast newspaper

Queries vary of time and task
 - Periodicities, trends, events
 - trends: like Tesla, repeated patterns: pizza on saturday
 - What's relevant to the queries changes over time (World Cup) -- What's going on now!
 - Task/Individual - 60% of queries occur in a session

What can logs tell us?
 - query frequency
 - patterns
 - click behavior

-- Experiments are the life blood of web systems
 - for every imaginable system variation (ranking, snippets, fonts, latency)
 - if I have 100M dollars to spend, what is important?
 - Basic questions:  What do you want to evaluate?  What are the metrics?

Uses of behavioral logs
 - Often surprising insights about how people interact with search systems
 - Suggest experiments

How do you go from 2.4 words to great results? 
 -> Lots of log data driving important features (query suggestion, autocompletion)

What they can't tell us?
 - Behavior can mean many things
 - Difficult to explore new systems

Web search != library search
 - Traditional "information needs" do not describe web searcher behavior
 - Broder 2002 from Alta Vista logs
 - They did a pop up survey in Jun-Nov through 2001

Desktop search != web search
 - desktop search, circa 2000
 - Stuff I've Seen
 - Example searches:  recent email from Fedor that contained a link to his new demo; query: Fedor
 pdf of a SIGIF paper on context and ranking sent a month ago; query: SIGIR
 - Deployed 6 versions of the system
 -> Queries: very short;  Date was by FAR the most common sort order
 -> Seldom do people switch from the default, but they did from best match to Date; the information from James; people remember a rough time. 
 -> People didn't care about they type of file, they cared that it was an image. 
 -> More re-finding than finding, more metadata than best match driven
 -> People remember attributes, seldom the details, only the general topic
 --> Rich client-side interface; every time we go int an are they have characteristics that are very different from other generations of search

Contextual Retrieval
 - One size does not it all
  --> SIGIR  (who's asking, where are they, what they have done in the past)
 - Queries are difficult to interpret in isolation
 - SIGIR - information retrieval vs. special inspector general for iraq reconstruction
 - A single ranking severely limits the potential because different people have different notions of relevance

Potential for Personalization
 - Framework to quantify the variation of relevance for the same query across individuals (Teevan et al., ToCHI 2010)
 - Regardless of how you measure it, there is tremendous potential; it varies widely across different queries
 - 46% potential increase in search ranking
 - 70% if we taken into account individual notions of relevance
 - Need to model individuals in different ways

Personal navigation
 - Teevan et al. SIGIR 2007, Tyler and Teevan WSDM 2010
 - Re-finding in web search; 33% are queries you've issued before
 - 39% of clicks are things they've visited before
 - "Personal" navigation queries
 --> 15% of queries 
 --> simple 12 line algorithm
 --> If you issued a query and clicked on on only one link twice, you are 95% likely to do it again
 - Resulted in online A/B experiments (successfully)

Adaptive Ranking
Bennett et al. SIGIR 2012
 - Queries don't occur in isolation
 - 60% of sessions contain multiple queries
 - 50% of the time occur in sessions that last 30+ mins (infrequent, but important)
 - 15% of tasks continue across sessions

User Model
 - specific queries, URLs, topic distributions
 - Session (short) +25%
 - Historic (long) +45%
 - combinations  - 65-75%

- By third query in a session, just pay attention to what is happening now. 

 - We have complementary methods to understand and model searchers
 - Especially important in new search domains and in accommodating the variability we see across people and tasks in the real world

 - More and more importance of spatial-temporal context (here now)
 - Richer representations and dialogs
  --> e.g. knowledge graphs
 - More proactive search (especially in mobile)
 - Tighter coupling of digital and physical worlds 
 - Computational platforms that couple human and algorithmic components
 - If search doesn't work for people, it doesn't work; Let's make sure it does!!!

We need to extend our evaluation methodologies to handle the diversity of searchers, tasks, and interactivity.

Disclaimer: The views here expressed are purely mine and do not reflect those of Google in any way.

Wednesday, August 15

SIGIR 2012 Best Paper Awards

Last night at the SIGIR 2012 banquet, James Allan presented the best paper awards.  This year there were two awards, plus an additional honorable mention!

One with the papers:

Honorable Mention
Robust Ranking Models via Risk-Sensitive Optimization  
Lidan Wang (UMd) Paul N Bennett (Microsoft) Kevyn Collins-Thompson (MSR)
This paper tackles the issue of robustness, and examines how systems that despite achieving gain overall may still significantly hurt many queries.  They present a framework for optimizing both effectiveness and robustness and the tradeoff between the two. 

Best Student Paper

Top-k Learning to Rank: Labeling, Ranking and Evaluation
Shuzi Niu (Institute of Computing Technology, CAS) Jiafeng Guo Yanyan Lan (Chinese Academy of Sciences) Xueqi Cheng (Institute of Computing Technology, CAS)
Best Paper Award

Time-Based Calibration of Effectiveness Measures 
Mark D Smucker (University of Waterloo), Charles L. A. Clarke (University of Waterloo)
In this paper, we introduce a time-biased gain measure, which explicitly accommodates such aspects of the search process... As its primary benefit, the measure allows us to evaluate system performance in human terms, while maintaining the simplicity and repeatability of system-oriented tests. Overall, we aim to achieve a clearer connection between user-oriented studies and system-oriented tests, allowing us to better transfer insights and outcomes from one to the other.

Monday, August 13

Norbert Fuhr SIGIR 2012 Salton Keynote Speech

Norbert Fuhr presented the Salton Award keynote speech.

James Allan presented Norbert Fuhr with the 10th Salton award.

He published IR paper in 1984, in Cambridge England.  The paper was 19 pages long.  Since then, he has authored over 200 papers.
 - foreshadowing learning ranking functions.
 - probablistic retrieval models
 - retrieval models for interactive retrieval

"Information Retrieval as Engineering Science"

[We have to listen to the old guys, but we don't have to accept it, but this doesn't hold for my talk today]

What is IR?
 - IR is about vagueness and imprecision in information systems

 - User is not able to precisely specify the object he is looking for
   --> "I am looking for a high-end Android smartphone at a reasonable price"
 - Typically, interative retrieval process.
- IR is not restricted to unstructured media

 - the person's knowledge about the objets in the database is incomplete / imprecise
 -> limitations in the representation
 -> imprecise object attributes (unreliable metadata, e.g. availability)

IR vs Databases
 -> DB: given a query, find objects o with o->q
 -> IR given a query q, find documents d with high values of P9d->q)
 -> DB is a special case of IR! (in a certain sense)

Foundation of DBs
 -> Codd's paper on a relational model was classified as Information Retrieval
 -> The concept of transactions separated the fields.

Fundamental differences between IR and DB is handling the pragmatic level.
DB:  User interactions with the application --> DBMs --> DB
IR: User interacts with the IR system -> over the collection
(separation between the management system and the application)

What IR could learn from DB systems
 Multiple steps of inference  "joins" a->b, b->c
 -> join, links over documents
 -> combine the knowledge across multiple documents

Expressive query language
 -> specify the inference scheme
 -> specify documents parts/aggregates to be retrieved

Data types and vague predicates
 -> not every string is text: times, dates, locations, amounts, person/product names.  [entities]
 -> provide data type-specific comparison predicates (<, set comparison, etc..)

IR as Engineering Science
 - Most of us feel our that we are engineers.  But, things are not as simple as they might seem.
 -> Example:  An IR person in civil engineering.

4 or 5 types of bridges - Boolean bridge, vector space, language modeling, etc..
 -- build all 5 and see which one stands up.
 -- Test the variants in live search
 -- Users in IR are blame themselves when they drive over a collapsing (non-optimal) system
 -- There could be serious implications by choosing a non-optimal system.

Why we need IR engineering
 -> IR is more than web search
Instiutions and companies have
 - large varieties of collections
 - board range of search tasks
[example: searching in the medical domain.  A doctor performs a search, and then waits 30 minutes for an answer.  We could return as engineers work on getting this down to 10 min)

Limitations of current knowledge
 - Moving in Terra Incognita
 - example: Africa.  Knowledge of the western world about the african geopgrahy several hundred years ago; the map of it was very innacurate and incomplete.
- At best, interpolation is reasonable.
- Extrapolation lacks theoretic foundation
-> But how to define the boundaries of current knowledge?

Theoretic Models
 -> Probability Ranking Principle
 -> Relevance oriented probabilistic models
 -> IR as uncertain inference
 -> Language Models

Value of Theoretic Models
 -> Deeper insight (scientific interest)
 -> General validity as basis for broad range of applications
 -> Make better predictions (engineer's view)

We should put more focus on the development of theoretic models.
 -> each theory is application within a well-defined application range

But, what is the application range?
 -> defined by the underlying assumptions
 -> Are the underlying assumptions of the model valid? For this, we need experiments to validate them.

 - Why vs How experiments

Why -> based on a solid theoretical model.
 -> performed for validating the model assumptions

 - based on some ad-hoc model
 - focus on the outcome
 - no interest in the underlying assumptions

How experiments
 -> Improvements that Don't Add Up: Ad Hoc retrieval results since 1998.
 -> Trec-8 adhoc collection, MAP
 -> It's easy to show improvements, but few beat the best official TREC result.
 -> Over 90% of the paper claim improvements that exist due to poor baselines, but do not beat the best TREC results.
-> Improvements don't add up.

Limitations of Emperical Approaches
 -> Is computer science truly scientific? CACM 7/2010

Theoertical vs Experimental

 - why
- explanatory power
- basis for a variety of approaches
 - long standing

 - How?
 - Good results on some collections
 - potential for some further improvements (in limited settings)
 - short lived

Why experimentation
 - Ex:  Binary Independence Retrieval model
 -> terms are distributed independently in the relevant and irrelevant documents
 -> did anyone ever check this?

Looking under the hood
 -> TF-IDF term weights in probablistic notion.  P(R) for a class of terms.
 -> Plots of relevance vs tf and IDF for trec adhoc and INEX IEEE

Systematic Experimentation
Towards evidence based IR
 -> A large variety of test collections
 -> large number of controlled variables

IR Engineering:  How are results affected by these components?

Controlled variables
 -> language, length, collection size, vocabulary size, domain, genre, structure
 -> length, linguistic structure, application domain, user expertise

What other variables are also important?

Even assuming these are the important variables, we have a high parameter search space.

A plug for evaluatIR.org -> supporting Benchmarking and meta-studies.

Grand IR Theory  vs Empirical Science
 -> Theory alone will not due.

Foundations of IR Engineering
-> Base layer: theory.  Then evidence, and we build the bridge on top of that.

IR Research Supporting Engineering
 1) Theory.  Proofs instead of empirics + heuristics
   - Experiments for validating underlying assumptions

2) Broad Empirical Evidence
 - Strict controlled experimental conditions
 - Repeat experiments with other colletions / tasks
 - variables affecting performance
 - metastudies

New IR Applications
 - Dozens of IR applications (see the SWIRL 2012 workshop)
 - Heuristic approaches are valuable for starting and comparison, but they are limited in the generality.
 -> We don't know how far we can generalize the method.

Conclusion: Possible Steps
 -> Encourage theoretic research of the 'why' type, e.g. having a separate conference track for these papers.
 -> Define and enforce rigid evaluation standards to be able to perform metastudies
 -> Setup repositories for standardized benchmarks.


Nick Belkin -> Where do the assumptions underlying the theory come from?  Where do we get evidence?    How would you approach that?
 -> A: Without any hypothesis, the observations are useless.  We need a back and forth between theory and experimentation.

DB and IR
-> Can they be united?  DB is part of IR.  IR is part of DB.   [the issues is bringing the people together]

David Hawking
 -> Have we hit the limit of our engineering capability?  What are the biggest opportunities for significant progress?
A:  We perhaps cannot improve the classical adhoc setting.  We need to know more about the user and their task.  Smartphone example:  your phone knows where are you, what time it is, looking for a Chinese restaurant (including opening hours).  We need to study the hard tasks for knowledge workers that integrate more deeply in their application.

Friday, August 10

Food and Drink Guide to Portland for SIGIR 2012

I'm preparing to leave for SIGIR 2012 in Portland.

If you're planning some sightseeing, or a place to catch up over a beer or meal with friends and colleagues, here are some ideas.

Portland has been made famous by the "Portlandia" series.  It's a quirky, young, outdoorsy, hipster, organic, crunchy kind of town, where "young people go to retire".  It's ground zero for the burgeoning craft beer, coffee, and micro-distilling movement in America.  It has been called "beervana" because of it's plethora of oustanding breweries, bars, and pubs.

To cut to the chase, here is my Food Map of Portland on Google maps.  Below are some of my sources and raw research notes.

Note: If you arrive in Portland early and you like food be sure to checkout the Bite of Oregon food festival taking places on Saturday and Sunday.

Portland food links:

Go for a hike

Fine dining Restaurants:
Beast  (think meat!) 
Le Pigeon (think Paris!) and their new place (foie gras profiteroles!) Little Bird (another french bistro)

Coffee Shops
Portland is known for having some of the best coffee in the country.  Here are some of the best places to try a cup.
Stumptown (several locations)
Ristretto Roasters
Coava Coffee
(the business area near the conference is a bit of a coffee & restaurant wasteland, so plan on venturing north into the heart of downtown)

Portland is big into food trucks
Nong's Khao Man Gai, Koi Fusion, Wolf and Bear, Big A$ sandwich, Lardo,  Good Food Here
Many are only open for lunch in downtown, so check their hours.  For late night nosh, check the east side of town (great after a night imbibing at one of the east side bars/pubs)
Mai Pho (lemongrass tofu over rice), Pyro Pizza and Give Pizza a Chance (same owner), and Sugar Cube for desert.

Top places to eat.
Ken and Zuke's Deli (think Portland's version of Katz's in NY)
People's Pig [temporarily closed] a food truck serving lunch, famous for it's Porchetta sandwich, see serious eats coverage)
Pok Pok - Thai street food (get the drinking vinegar) - be prepared to wait, there is always a long line (think 1 to 1.5 hours.  Go across the street and wait at the whisky soda lounge), famous for its fish sauce wings.  There is a new restaurant, Ping in downtown from the same owners, which was just named one of GQ's ten best new restaurants.  It's reasonably priced, casual food without crazy lines.
Beaker and Flask (restaurant of the year by Willamette Week)
Ken's Artisan Pizza
Olympic Provisions - great lunch places known for salumi
The Meadow - Artisan chocolates and food
Bakeshop - artisan croissants and bakery
Salt and Straw - ice cream
Coco Donut -- the locals prefer it to the more touristy Voodoo Donuts

Fine dining
Le Pigeon
Little Bird Bistro
Toro Bravo 
Beast (James Beard nominated (winner?) chef Naomi Pomeroy makes killer meat.  I have a cooking crush on her)

PSU Portland farmer's market (on Saturday)

Beer & Spirits

Distillery Row

Clear Creek Distillery

Top pubs / Beer Bars
Bailey's Taproom (an icon)
Horse Brass (an icon)
Belmont Station
Beer Mongers
Green Dragon

Brewpubs to visit
Hopworks urban brewery  (classic portland - eco-brewpub -- bikes, beer, and great food)
Roots Organic brewpub

See the Portland entries in 2012 top breweries in the world: http://www.ratebeer.com/RateBeerBest/bestbrewers_012012.asp
Hair of the dog (the gold standard of portland breweries - best brewery in portland -- also has great food.  A must visit!)
Upright brewing company (hot new upcoming - just beer, in an hard to find location.  hard core beer geeks need apply)
Deschutes (large, popular standby)
Cascade Brewing (known for crazy sour beer!)
Breakside Brewery
Hopworks urban brewery
Gigantic Brewing
 - brand new brewery just opened in may.
Rogue brewery is a hometime favorite --> their current Voodoo donut maple bacon beer is really unique

Must try beers
BridgePort IPA
Deschutes Bachelor Bitter
DOA, Hopworks
Rogue Maple Bacon Doughnut beer

Consider a beer walking tour:  cascade brewing -->  green dragon --> hair of the dog

Beaker and Flask (restaurant of the year by Willamette Week, awesome cocktails!)
Clyde Common (great food + drinks, casual, great for a group in downtown)
Rum Club

Thursday, April 12

Amazon CloudSearch, Elastic Search as a Service

The search division of Amazon, A9 today announced the release of CloudSearch.  Amazon CTO Werner Vogels announced it on his blog, All Thing Distributed.  The AWS service also has a new post on the announcement.

For the details and pricing, there is also the official CloudSearch details page.

CloudSearch is a fully managed search service based on Amazon's search infrastructure that provides near-realtime, faceted, scalable search.  The index is stored in memory for fast search and updates.

Dynamic Scaling
What makes A9 offering particularly interesting is it's ability to dynamically scale.  The architecture of A9's search system, with shards and replicas, is a common and well-understood model.  What makes Amazon's offering unique is the ability to easily scale your search cluster.  A9 will automatically add (and remove) search instances and index partitions as the index size grows and shrinks.  It will also dynamically add and remove replicas to respond to changes in search request traffic.    The exact details are still not clearly described technically in detail.

Right now, there is a limit to 50 search instances.  An extra large search instance can handle approximately 8 Million 1K documents. It appears that assumption is that the documents are quite small (e.g. product documents).  To put it in perspective, an rough rule of thumb for web documents is approximately 10k.  Given this, it translates into roughly 800k web documents per server * 50 servers = 40 million web documents.  This is not for building large-scale web search, yet.  However, it should be more than enough for most enterprise e-commerce and site-search applications.

The real value added by the search engine is in the ranking of results.

The control over the search index ranking is rudimentary with a few basic knobs.  You can add stopwords, perform stemming, and add synonyms.  This is very basic stuff.    How you might do more interesting (and important) IR ranking changes is vague.  From the article,
Rank expressions are mathematical functions that you can use to change how search results are ranked. By default, documents are ranked by a text relevance score that takes into account the proximity of the search terms and the frequency of those terms within a document. You can use rank expressions to include other factors in the ranking. For example, if you have a numeric field in your domain called 'popularity,' you can define a rank expression that combines popularity with the default text relevance score to rank relevant popular documents higher in your search results.
This indicators that it is possible to boost documents.  However, it is unclear how the underlying text search works in order to boost individual important fields (e.g. name, description).

For more details on the more advanced query processing needed to make search work in practice, read the post: Query Rewriting in Search Engines from Hugh Williams at EBay.  In order to employ these methods, you need log data, which brings me to my next point.

Missing Pieces
A key missing component is usage-driven framework to improve ranking that uses queries, clicks, and other user behavior indicators.  A feedback mechanism to change ranking based analysis (ideally automatic).

Overall, the most compelling aspect of this is the dynamic scaling.  It gives people a simple, platform that scales transparently for many enterprise search and ecommerce applications.

Tuesday, November 8

Notes on Strata 2011: Entities, Relationships, and Semantics: the State of Structured Search

Entities, Relationships, and Semantics: the State of Structured Search

I didn't attend the talk, but I watched the video and took down notes on it for future reference.

Andrew Hogue (Google NY)
 - worked on google squared
 - QA on google, NER, local search
 - (extraction is never perfect) even with a clean db, with freebase.  coverage isn't good, 20/200 dog breeds
 - if you try to build a se on top of the incomplete db, users hit the limit, fall off the cliff and get frustrated
 - Tried to build user models of what people like (for Google+).  Do you like Tom Hanks, BIG? In the real-world.
   (Coincidentally, Google just rolled out Google+ Pages that represent entity pages)
    --> if the universe isn't complete, people, entities, then they get frustrated
    --> 1) get a bigger db.  2) fall back gracefully to a world of strings (hybrid systems)

Breck baldwin (alias-i)
 - go hunt down my blog post (on march 8 '09 on how to approach new NLP projects)
 - the biggest problem is the NLP system in the head vs. reality
 - three steps: 1) take some data an annotate it.  10 examples.  force fights earlier.  #1 best thing.  #2 build simple prototypes. info flow is hard.  #3 eval metric that maps to the business need

Evan Sandhause (NY Times)
 - on the semantic web (3.0) 
 - the semantic web is a complex implementation of good, simple ideas
 - get your toe wet with a few areas: 1) linked data, and 2) semantic markup
 - 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT -  No context to what those IDs mean. e.g. barack obama is the president of the united states.  Kansas city is the capital...  you need to link the external data to add new understanding.
   -- e.g. find all articles in A1, P1 that mention presidents of the United States
   -- e.g. find all articles that occur near park slope brooklyn
 2) semantic markup (rdfa, microformat, rich snippets).  They use rnews vocab as part of schema.org.

Wlodek Zadrozny (IBM.  Watson)
 - what are the open problems in QA
 - Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)
 - Then scores and ranks the candidate answers.  Some of it in RDF data.  Confidences are important because wrong answers are penalized.

keys to success: 1) data, 2) methodology, testing often  1. QA answer sets from historic archives. (200k qa pairs)  2. collection data sources. and 3. and test (trace) data (7k experiments, 20-700 mb per experiment.  lots of error analysis.
 - medical, legal, education

Q: NYT R&D.  The trend around NLP.  Certain things graduate on reliability.  What will these be over the next decade?
  -- Andrew.  The most interesting thing is QA.  Surface answers to direct questions.  (harvard college vs lebron james college)
  -- statistical approaches to language, (when do we have a good parse, vs. we don't know)
  -- Breck - classifiers are getting robust on sentiment, topic classification. breakthroughs in highly customized systems.  finely tuned to a domain in ways that bring lots of value.

Query vs. Document centric
  -- reason across documents at a meta-level.  What can you do when you have great meta-data? (we have hand-checked, clean, data)
  -- in Watson, an alternative to high-quality hand curated data is to augment existing sources with data from the web
     (see Statistical Source Expansion for Question Answering from Nico Schlaefer at CIKM 2011)

QA on the open web
 - Problem - not enough information from users.  People don't ask full NLP questions (30 to 1)

- Is there an answer?  (Google wins by giving people documents and presenting many possible answers)

Evan - the real-time metadata is needed for the website.  They use a rule based information extraction system which suggests terms they might want want to suggest.  Then the librarians review the producers tags.  

Breck - Recall is hard.  In NER and others.

Overall Summary
 - Wlodek - QA depends on having the data: 1) training/test data, 2) sources, and 3) system tests
 - Evan - Structured data is valuable to get out there, rNews and schema.org.  Publishers should publish it!  It will be a game changer.
 - Breck - 1) annotate your data before you do it. 2) have an eval metric, and 3) lingpipe is free, so use it.
 - Andrew - (involved in schema.org, freebase).  Share your data.  Get it out there.  And -- Ask longer queries!