Tuesday, November 8

Notes on Strata 2011: Entities, Relationships, and Semantics: the State of Structured Search

Entities, Relationships, and Semantics: the State of Structured Search

I didn't attend the talk, but I watched the video and took down notes on it for future reference.

Andrew Hogue (Google NY)
 - worked on google squared
 - QA on google, NER, local search
 - (extraction is never perfect) even with a clean db, with freebase.  coverage isn't good, 20/200 dog breeds
 - if you try to build a se on top of the incomplete db, users hit the limit, fall off the cliff and get frustrated
 - Tried to build user models of what people like (for Google+).  Do you like Tom Hanks, BIG? In the real-world.
   (Coincidentally, Google just rolled out Google+ Pages that represent entity pages)
    --> if the universe isn't complete, people, entities, then they get frustrated
    --> 1) get a bigger db.  2) fall back gracefully to a world of strings (hybrid systems)

Breck baldwin (alias-i)
 - go hunt down my blog post (on march 8 '09 on how to approach new NLP projects)
 - the biggest problem is the NLP system in the head vs. reality
 - three steps: 1) take some data an annotate it.  10 examples.  force fights earlier.  #1 best thing.  #2 build simple prototypes. info flow is hard.  #3 eval metric that maps to the business need

Evan Sandhause (NY Times)
 - on the semantic web (3.0) 
 - the semantic web is a complex implementation of good, simple ideas
 - get your toe wet with a few areas: 1) linked data, and 2) semantic markup
 - 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT -  No context to what those IDs mean. e.g. barack obama is the president of the united states.  Kansas city is the capital...  you need to link the external data to add new understanding.
   -- e.g. find all articles in A1, P1 that mention presidents of the United States
   -- e.g. find all articles that occur near park slope brooklyn
 2) semantic markup (rdfa, microformat, rich snippets).  They use rnews vocab as part of schema.org.

Wlodek Zadrozny (IBM.  Watson)
 - what are the open problems in QA
 - Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)
 - Then scores and ranks the candidate answers.  Some of it in RDF data.  Confidences are important because wrong answers are penalized.

keys to success: 1) data, 2) methodology, testing often  1. QA answer sets from historic archives. (200k qa pairs)  2. collection data sources. and 3. and test (trace) data (7k experiments, 20-700 mb per experiment.  lots of error analysis.
 - medical, legal, education

Q: NYT R&D.  The trend around NLP.  Certain things graduate on reliability.  What will these be over the next decade?
  -- Andrew.  The most interesting thing is QA.  Surface answers to direct questions.  (harvard college vs lebron james college)
  -- statistical approaches to language, (when do we have a good parse, vs. we don't know)
  -- Breck - classifiers are getting robust on sentiment, topic classification. breakthroughs in highly customized systems.  finely tuned to a domain in ways that bring lots of value.

Query vs. Document centric
  -- reason across documents at a meta-level.  What can you do when you have great meta-data? (we have hand-checked, clean, data)
  -- in Watson, an alternative to high-quality hand curated data is to augment existing sources with data from the web
     (see Statistical Source Expansion for Question Answering from Nico Schlaefer at CIKM 2011)

QA on the open web
 - Problem - not enough information from users.  People don't ask full NLP questions (30 to 1)

- Is there an answer?  (Google wins by giving people documents and presenting many possible answers)

Evan - the real-time metadata is needed for the website.  They use a rule based information extraction system which suggests terms they might want want to suggest.  Then the librarians review the producers tags.  

Breck - Recall is hard.  In NER and others.

Overall Summary
 - Wlodek - QA depends on having the data: 1) training/test data, 2) sources, and 3) system tests
 - Evan - Structured data is valuable to get out there, rNews and schema.org.  Publishers should publish it!  It will be a game changer.
 - Breck - 1) annotate your data before you do it. 2) have an eval metric, and 3) lingpipe is free, so use it.
 - Andrew - (involved in schema.org, freebase).  Share your data.  Get it out there.  And -- Ask longer queries!