Presented by Owen O'Malley and Eric Baldeschwieler (Yahoo!)
Owen gave a brief overview of the history of Hadoop and the Hadoop ecosystem.
Yahoo! Hadoop distribution
The big news is that on stage Eric announced the release of the Yahoo! distribution, the distribution that Yahoo! uses internally. It's not new software, but a re-packaging of publicly available releases that have been tested internally on Yahoo!'s clusters. They're starting with the 0.20 release of Hadoop as an Alpha release and this will continue to grow and stabilize.
Yahoo now has dozen's of clusters with 25,000+ nodes. The largest cluster has about 4,000 nodes.
Contains content and metadata that powers Yahoo! search.
- 70 hours runtime
- 300 TB shuffling
- 200 TB output
- 73 hours
- 490 TB shuffling
- 280 TB output
- 55%+ hardware
- 2000 nodes
- 6 PB raw disk
- 16 TB RAM
- 16k CPUs
- 4000 nodes
- 16 PB disk
- 64 TB RAM
- 32k CPUs (40% faster cpus)
Major features coming to Hadoop
- Backwards compatibility (0.21 will make the last big API changes)
- Append, sync, and flush support
- Scheduling - Capacity and Fairshare
- Continuous integration - easier to build and test Hadoop distributions
Pig - "Make pigs fly"
- Support for SQL and metadata
- Column oriented storage access layer (a new column-oriented storage view, not services like HBase)
- Multi-query optimizations
Oozie is a new workflow and scheduling system
There was a question about cluster management. Eric recommend that people use Chukwa and Ganglia as open-source tools for large-scale cluster management.