Thursday, May 24

HBase: Powerset's BigTable

Jim Kellerman, a senior engineer at Powerset, has started an open source version of Google's BigTable, called HBase.

From the HBase wiki:
Design (and subsequently implement) a structured storage system as similar to Google's Bigtable as possible for the Hadoop environment. Both Google's Google File System and Hadoop's HDFS provide a mechanism to reliably store large amounts of data. However, there is not really a mechanism for organizing the data and accessing only the parts that are of interest to a particular application... Bigtable (and Hbase) provide a means for organizing and efficiently accessing these large data sets.
Current status:

As of this writing, there is just shy of 9000 lines of code in "src/contrib/hbase/src/java/org/apache/hadoop/hbase/" directory on the Hadoop SVN trunk.

There are also about 2500 lines of test cases.

All of the single-machine operations (safe-committing, merging, splitting, versioning, flushing, compacting, log-recovery) are complete, have been tested, and seem to work great.

The multi-machine stuff (the HMaster, the HRegionServer, and the HClient) are in the process of being debugged. And work is in progress to create scripts that will launch the HMaster and HRegionServer on a Hadoop cluster.

Jim became a committer on Hadoop (Doug Cutting and Yahoo's open source Map-Reduce framework) last week - congratulations!

Hopefully, it won't be long before HBase and the related Google clones mature and we have robust, open source, Java, implementations of much needed infrastructure: GFS (HDFS) , Map-Reduce (Hadoop), Sawzall (Pig, see my previous discussion), and BigTable (HBase). And we can thank Yahoo, Powerset, and the other supporters. Keep up the good work!