Tuesday, July 7

Hadoop Data Serialization Battle: Avro, Protocol Buffers, Thrift

My current research project involves processing large text corpora using Hadoop. I'll have more to say about the details in a future post. Right now, I want to focus on reading/writing custom data types in Hadoop.

For example, say you want to create an inverted index with a "Posting" data type that has a term (text), docId (long), and a array of occurrences (ints). One standard way is to write a class that implements Hadoop's Writable (and Comparable for keys) interfaces (see the Serialization section of Tom White's book). However, writing the classes is tedious, error prone, and a maintenance hassle which is done for every non-trivial value you want to write. If I can, I want to avoid this. Let's talk about other options.

The first two obvious choices are Facebook's Apache Thrift (JIRA patch code) and Google Protocol Buffers (JIRA patch). Both provide an Interface Definition Language (IDL) that is not language specific, which is nice. They plug into Hadoop via the Serializer package. You can also read Tom White's blog post from last year on the topic.

Of course, there is Streaming for non-Java jobs which serializes everything as a string via stdin and stdout. It introduces a 20-100% overhead (according to Yahoo!) in job execution performance. It's great for rapid prototyping, but maybe not as much for terabyte scale data processing!

Then of course, there is the new kid kid: Avro. Avro is a serialization and RPC framework created by Doug Cutting (Hadoop creator) after a hard look at PB and Thrift. It defines a data schema which is stored beside the data (not combined with it). The schemas are defined in JSON, so they are easy to parse in many languages. One big difference is that Avro doesn't generate code stubs; although you have the option to do so for statically typed languages.

As it happens, Doug just asked people to try out the 1.0 release candidate. It looks promising, there were several recent posts (one, two) by Johan Oskarsson from Last.Fm.

You should check out the data serialization benchmark. The benchmark results show that Avro is quite competitive already (except the timeCreate, which is odd).

In the meantime, comment and let me know your thoughts and experiences.

4 comments:

  1. Anonymous10:45 PM EDT

    The right solution is to use generics. Don't be fooled by those frameworks. It's boolshit.

    ReplyDelete
  2. Avro is very interesting project to me since it takes away some concerns about dynamic (de)serializaion in various programming languages.
    Open source "Coord" have used gSOAP, but I will use avro instead of it in the near future.

    ReplyDelete
  3. Do you have any sample code on how to use Avro? I can't find any.

    ReplyDelete
  4. I think an amazing solution is a collection of the three framework:

    protocol buffers: message serialization format and code generation

    thrift: service generation format and jboss remoting or spring remoting
    friendly thrift stack(this action needs some changes to thrift
    source)

    avro: amazing dynamic typing,it can serialization or unserialization a
    normal POJO object

    ReplyDelete