Wednesday, June 10

Hadoop Summit: Cloudera on the Hadoop community

Christophe Bisciglia gave an overview of the growth in the community, downloads, etc... He outlined some of the work Cloudera is doing to make things easier for developers:

Sqoop a database import tool for hadoop, inspects tables and autogenerates code, extracts it into HDFS and imports it into Hive for analysis.

They also worked on Hadoop support with Amazon's Elastic Block Storage (EBS). See their blog post.

They are looking forward to Hadoop 0.20 and just announced Beta packages for release. They are also packaging Yahoo!'s raw release for Debian/Redhat.

Peter Skomoroch created a cool demo called "Trending topics" in 1 week. Wikipedia data is stored in EBS. myqsl data imported via sqoop into Hive for analysis/R&D

Cloudera is hosting Hadoop summit East in NYC on October 2nd. Submissions close on July 31st. Early registration is open.

Most of the Cloudera training is available online for free. Also be sure to check out Cloudera's blog.

Hadoop Summit Coverage: State of Hadoop

Presented by Owen O'Malley and Eric Baldeschwieler (Yahoo!)

Owen gave a brief overview of the history of Hadoop and the Hadoop ecosystem.

Yahoo! Hadoop distribution

The big news is that on stage Eric announced the release of the Yahoo! distribution, the distribution that Yahoo! uses internally. It's not new software, but a re-packaging of publicly available releases that have been tested internally on Yahoo!'s clusters. They're starting with the 0.20 release of Hadoop as an Alpha release and this will continue to grow and stabilize.

Yahoo now has dozen's of clusters with 25,000+ nodes. The largest cluster has about 4,000 nodes.

Contains content and metadata that powers Yahoo! search.

In 2008
  • 70 hours runtime
  • 300 TB shuffling
  • 200 TB output
In 2009
  • 73 hours
  • 490 TB shuffling
  • 280 TB output
  • 55%+ hardware
Cluster stats
In 2008
  • 2000 nodes
  • 6 PB raw disk
  • 16 TB RAM
  • 16k CPUs
In 2009
  • 4000 nodes
  • 16 PB disk
  • 64 TB RAM
  • 32k CPUs (40% faster cpus)
Major features coming to Hadoop
  • Backwards compatibility (0.21 will make the last big API changes)
  • Append, sync, and flush support
  • Scheduling - Capacity and Fairshare
  • Continuous integration - easier to build and test Hadoop distributions
Pig - "Make pigs fly"
  • Support for SQL and metadata
  • Column oriented storage access layer (a new column-oriented storage view, not services like HBase)
  • Multi-query optimizations
Oozie is a new workflow and scheduling system

There was a question about cluster management. Eric recommend that people use Chukwa and Ganglia as open-source tools for large-scale cluster management.

Tuesday, June 9

Hadoop event coverage

I'm in Santa Clara for ScaleCamp event, the Hadoop Summit, and Hadoop training. I'll be blogging my notes from the conferences starting later tonight.