Tuesday, November 8

Notes on Strata 2011: Entities, Relationships, and Semantics: the State of Structured Search

Entities, Relationships, and Semantics: the State of Structured Search

I didn't attend the talk, but I watched the video and took down notes on it for future reference.

Andrew Hogue (Google NY)
 - worked on google squared
 - QA on google, NER, local search
 - (extraction is never perfect) even with a clean db, with freebase.  coverage isn't good, 20/200 dog breeds
 - if you try to build a se on top of the incomplete db, users hit the limit, fall off the cliff and get frustrated
 - Tried to build user models of what people like (for Google+).  Do you like Tom Hanks, BIG? In the real-world.
   (Coincidentally, Google just rolled out Google+ Pages that represent entity pages)
    --> if the universe isn't complete, people, entities, then they get frustrated
    --> 1) get a bigger db.  2) fall back gracefully to a world of strings (hybrid systems)

Breck baldwin (alias-i)
 - go hunt down my blog post (on march 8 '09 on how to approach new NLP projects)
 - the biggest problem is the NLP system in the head vs. reality
 - three steps: 1) take some data an annotate it.  10 examples.  force fights earlier.  #1 best thing.  #2 build simple prototypes. info flow is hard.  #3 eval metric that maps to the business need

Evan Sandhause (NY Times)
 - on the semantic web (3.0) 
 - the semantic web is a complex implementation of good, simple ideas
 - get your toe wet with a few areas: 1) linked data, and 2) semantic markup
 - 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT -  No context to what those IDs mean. e.g. barack obama is the president of the united states.  Kansas city is the capital...  you need to link the external data to add new understanding.
   -- e.g. find all articles in A1, P1 that mention presidents of the United States
   -- e.g. find all articles that occur near park slope brooklyn
 2) semantic markup (rdfa, microformat, rich snippets).  They use rnews vocab as part of schema.org.

Wlodek Zadrozny (IBM.  Watson)
 - what are the open problems in QA
 - Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)
 - Then scores and ranks the candidate answers.  Some of it in RDF data.  Confidences are important because wrong answers are penalized.

keys to success: 1) data, 2) methodology, testing often  1. QA answer sets from historic archives. (200k qa pairs)  2. collection data sources. and 3. and test (trace) data (7k experiments, 20-700 mb per experiment.  lots of error analysis.
 - medical, legal, education

Q: NYT R&D.  The trend around NLP.  Certain things graduate on reliability.  What will these be over the next decade?
  -- Andrew.  The most interesting thing is QA.  Surface answers to direct questions.  (harvard college vs lebron james college)
  -- statistical approaches to language, (when do we have a good parse, vs. we don't know)
  -- Breck - classifiers are getting robust on sentiment, topic classification. breakthroughs in highly customized systems.  finely tuned to a domain in ways that bring lots of value.

Query vs. Document centric
  -- reason across documents at a meta-level.  What can you do when you have great meta-data? (we have hand-checked, clean, data)
  -- in Watson, an alternative to high-quality hand curated data is to augment existing sources with data from the web
     (see Statistical Source Expansion for Question Answering from Nico Schlaefer at CIKM 2011)

QA on the open web
 - Problem - not enough information from users.  People don't ask full NLP questions (30 to 1)

- Is there an answer?  (Google wins by giving people documents and presenting many possible answers)

Evan - the real-time metadata is needed for the website.  They use a rule based information extraction system which suggests terms they might want want to suggest.  Then the librarians review the producers tags.  

Breck - Recall is hard.  In NER and others.

Overall Summary
 - Wlodek - QA depends on having the data: 1) training/test data, 2) sources, and 3) system tests
 - Evan - Structured data is valuable to get out there, rNews and schema.org.  Publishers should publish it!  It will be a game changer.
 - Breck - 1) annotate your data before you do it. 2) have an eval metric, and 3) lingpipe is free, so use it.
 - Andrew - (involved in schema.org, freebase).  Share your data.  Get it out there.  And -- Ask longer queries!


  1. Thanks for shearing i like too see that!

  2. Thanks for sharing this information, it helped me a lot in finding valuable resources for my career. I gained more information by reading your post.

    Loadrunner Training in Chennai

  3. Great Information,it has lot for stuff which is informative. Mechanical Engineering Capstone Project Help I will share the post with my friends.

  4. I personally like your post, you have shared good article. harvard case study solutions It will help me in great deal.

  5. I love reading through your blog, I wanted to leave a little comment to support you and wish you a good continuation. Wish you best of luck for all your best efforts. I want to share some information, we provide
    outsource data entry services in all over the world.
    e Commerce Support Services

  6. Thank you for sharing this information and I really appreciate this effort of yours next I will be waiting for your great future update post thank you very much
    e Commerce Service Providers

  7. I love reading through your blog, I wanted to leave a little comment to support you and wish you a good continuation. Wish you best of luck for all your best efforts. I want to share some information, we provide outsource data entry services in all over the world.
    eCommerce Supply Chain Management

  8. Dear, I like all your post. Everything looks so sweet, I admire this kind of life and the best wishes for you. Hope that we can communicate with each other. By the way, anybody want to boost app ranking ? This one is useful.

  9. I’m glad you enjoyed it. Those are great habits! Thank you for sharing.


  10. Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.

    Best Linux training in Noida
    Linux Training Institute in Noida
    Shell Scripting Training Institute in Noida

  11. Assignments are designed to make the students burn the midnight oil every day. Students experience depression, anxiety problems and undesirable stress due to so much work and so on. So they tend to move towards taking Assignment Help

  12. know. The design and style look great though! Hope you get the

    wedding wishes for daughter and son in law

  13. Welcome to the Best writers Reviews, Here you can get the best All Assignment Help reviews sites. We strongly urge you to check our entire website once and we will assure you will find this review website very useful. Our hard work will be rewarded if students like you will appreciate our effort and spread the message about this site with your class-fellows and friends.

  14. تابلو چلنیوم یا چلنیوم از تابلو های مدرن و استاندارد می باشد. اولین تولید کننده چلنیوم کشور اسپانیا است. مشتریان محترم باید توجه داشته باشند تابلو چلنیوم یکی از زیبا ترین تابلو های تبلیغاتی می باشد. البته در صورتیکه در انتخاب رنگ رویه پلکسی، رنگ لبه چلنیوم، قطر قلم حروف و انتخاب متریال دقت گردد. تابلو سازی پاسارگاد با مدیریت محسن رسولی و سرپرست کارگاه استاد حسن فراهانی با سابقه حروف سازی 35 ساله یکی از اولین سازندگان تابلو چلنیوم در ایران می باشد. در این صحفه توضیحاتی خواهیم داد، انواع مختلف را بررسی میکنیم و قسمت‌های مختلف یک تابلوی چلنیوم را شرح می دهیم. برای یاد گرفتن اصول پایه قبل از سفارش، زمانی بگذارید و اطمینان حاصل کنید تا سفارشی که میدهید تابلوی مورد تائید شما باشد.