Hadoop 0.20.1 is finally here. Get it while it's hot! If you want, you can read the full release notes. This is the release to use if you are setting up a new cluster. It's also worth upgrading older pre 0.20.x clusters to this release.
Hadoop 0.20.x is very different from previous releases. The configuration and APIs have been overhauled. As previously mentioned, there is the new TFile storage format.
Look for an imminent release of PIG 0.4 release and Cloudera distribution CDH2 0.20.x with Hive and PIG support.
Wednesday, September 16
Today continues the series (part I: search, part II: machine learning) of Henry's notes from the Yahoo! Key Scientific Challenges summit. Today we are covering Brian Cooper's talk on challenges in "Web Information Management" which deals with structured data, unstructured data, and making structure out of unstructure.
- Goal: from unstructured -> structured
- Goal: system and process for fast and easy building of domain-centric operations.
- exploits structured regularities/proxies to nested concepts
- lists, records, attributes
- create business directories for store locations
- pulling useful tidbits of info from around the web, dereferencing them, and then presenting them to the user
- scalability is important
- get rid of some complex features
- adaptive allocation for reduced server load
- relying on these is messy
- photo albums online allow for quick searching
- image labeling
- could use ESP, but relies on users playing the game
- OR let people tag as normal and then offline...
- detect similar photos
- overlap tags
- collaborative tagging