Saturday, August 29

Hadoop: Major Platform Upgrades Coming Soon

The Hadoop world is undergoing rapid evolution. Tom White has a presentation called Hadoop Futures available on slideshare that outlines some of the next major directions.

There are some important changes to keep your eye on. In the next month we will see major releases that will change the Hadoop landscape.

First up is the Hadoop 0.20.1 release. It is a major Hadoop release. It will (likely) be used as the basis for the next Y! and Cloudera distributions. Hadoop 0.20 was released in June, but hasn't been widely adopted until some of the bugs were worked out. Hadoop 0.20.x has a new API that will be used going forward. The upcoming point release has a lot of fixes and features, including the new TFile format. The new version is critical because it opens up the way for releases from the sub-projects.

The 0.20.1 release paves the way for the PIG 0.50 release. PIG 0.40 and 0.50 will have significant performance and other improvements that have been developed over the past months. One key change is that it will likely include PIG SQL support that is now in the trunk.

The release of HBase 0.20 is getting very close. There are great presentations on the new releases given at the recent HBase User Group Meeting at StumbleUpon. Again, one of the key new features is a new HFile format based on the TFile that will be in the 0.20.1 release.

In the very near future we will also see a bug fix release from the Avro serialization system, Avro 1.0.1.

In short, by the middle to end of September we should see the adoption of a new and radically improved Hadoop platform. We can ditch the aging 0.18.x platform. We will finally be able to use the new scheduling systems, simplified API, and take advantage of significant performance and reliability improvements.