My current research project involves processing large text corpora using
Hadoop. I'll have more to say about the details in a future post. Right now, I want to focus on reading/writing custom data types in Hadoop.
For example, say you want to create an inverted index with a "Posting" data type that has a term (text), docId (long), and a array of occurrences (ints). One standard way is to write a class that implements Hadoop's Writable (and Comparable for keys) interfaces (see the
Serialization section of
Tom White's book). However, writing the classes is tedious, error prone, and a maintenance hassle which is done for every non-trivial value you want to write. If I can, I want to avoid this. Let's talk about other options.
The first two obvious choices are Facebook's
Apache Thrift (
JIRA patch code) and Google
Protocol Buffers (
JIRA patch). Both provide an Interface Definition Language (IDL) that is not language specific, which is nice. They plug into Hadoop via the
Serializer package. You can also read Tom White's
blog post from last year on the topic.
Of course, there is
Streaming for non-Java jobs which serializes everything as a string via stdin and stdout. It introduces a 20-100% overhead (according to Yahoo!) in job execution performance. It's great for rapid prototyping, but maybe not as much for terabyte scale data processing!
Then of course, there is the new kid kid:
Avro. Avro is a serialization and RPC framework created by
Doug Cutting (Hadoop creator) after a hard look at PB and Thrift. It defines a data schema which is stored beside the data (not combined with it). The schemas are defined in JSON, so they are easy to parse in many languages. One big difference is that Avro doesn't generate code stubs; although you have the option to do so for statically typed languages.
As it happens, Doug just asked people to try out the
1.0 release candidate. It looks promising, there were several recent posts (
one,
two) by
Johan Oskarsson from Last.Fm.
You should check out the
data serialization benchmark. The
benchmark results show that Avro is quite competitive already (except the timeCreate, which is odd).
In the meantime, comment and let me know your thoughts and experiences.