Thursday, September 17

Hadoop 0.20.1 released

Hadoop 0.20.1 is finally here. Get it while it's hot! If you want, you can read the full release notes. This is the release to use if you are setting up a new cluster. It's also worth upgrading older pre 0.20.x clusters to this release.

Hadoop 0.20.x is very different from previous releases. The configuration and APIs have been overhauled. As previously mentioned, there is the new TFile storage format.

Look for an imminent release of PIG 0.4 release and Cloudera distribution CDH2 0.20.x with Hive and PIG support.

Wednesday, September 16

Yahoo! Key Scientific Challenges Coverage III: Web Information Management

Today continues the series (part I: search, part II: machine learning) of Henry's notes from the Yahoo! Key Scientific Challenges summit. Today we are covering Brian Cooper's talk on challenges in "Web Information Management" which deals with structured data, unstructured data, and making structure out of unstructure.

Information extraction
  • Goal: from unstructured -> structured

  • How?
    - site-specific
    - format-specific
    - domain/category-centric

  • Goal: system and process for fast and easy building of domain-centric operations.
List extraction versus entity extraction
  • exploits structured regularities/proxies to nested concepts
    - lists, records, attributes
  • create business directories for store locations
  • pulling useful tidbits of info from around the web, dereferencing them, and then presenting them to the user
  • scalability is important
    - get rid of some complex features
  • speed
  • adaptive allocation for reduced server load
  • tagging
    - relying on these is messy
  • photo albums online allow for quick searching
  • image labeling
    - could use ESP, but relies on users playing the game
    - OR let people tag as normal and then offline...
  • detect similar photos
  • overlap tags
  • collaborative tagging
(See also PSOX and information extraction using community efforts).