Wednesday, September 16

Yahoo! Key Scientific Challenges Coverage III: Web Information Management

Today continues the series (part I: search, part II: machine learning) of Henry's notes from the Yahoo! Key Scientific Challenges summit. Today we are covering Brian Cooper's talk on challenges in "Web Information Management" which deals with structured data, unstructured data, and making structure out of unstructure.

Information extraction
  • Goal: from unstructured -> structured

  • How?
    - site-specific
    - format-specific
    - domain/category-centric

  • Goal: system and process for fast and easy building of domain-centric operations.
List extraction versus entity extraction
  • exploits structured regularities/proxies to nested concepts
    - lists, records, attributes
  • create business directories for store locations
  • pulling useful tidbits of info from around the web, dereferencing them, and then presenting them to the user
  • scalability is important
    - get rid of some complex features
  • speed
  • adaptive allocation for reduced server load
  • tagging
    - relying on these is messy
  • photo albums online allow for quick searching
  • image labeling
    - could use ESP, but relies on users playing the game
    - OR let people tag as normal and then offline...
  • detect similar photos
  • overlap tags
  • collaborative tagging
(See also PSOX and information extraction using community efforts).

