Tuesday, July 8

Cool new serialization library: Google Protocol Buffers

Update: Tom White, from Hadoop, has a follow-up on Hadoop's serialization method and the possible application of Google Protocol Buffers.

Originally via Matt Cutts.
Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format...

As the system evolved, it acquired a number of other features and uses:
  • Automatically-generated serialization and deserialization code avoided the need for hand parsing.
  • In addition to being used for short-lived RPC (Remote Procedure Call) requests, people started to use protocol buffers as a handy self-describing format for storing data persistently (for example, in Bigtable).
  • Server RPC interfaces started to be declared as part of protocol files, with the protocol compiler generating stub classes that users could override with actual implementations of the server's interface.
Protocol buffers are now Google's lingua franca for data – at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.
I'm a big fan of compact and simple serialization formats. We've developed a similar system here at Globalspec for our inter-system communication. If I'm allowed, perhaps I can elaborate more in the future. It's really exciting to take a look at Google's solution to a similar problem. One of the coolest features is that their protocol language generates stubs for multiple languages: C++, Java, and Python.

I wonder if Hadoop/HBase could leverage this as a way to store serialize data in the file system.

I can't wait to try it out, this looks incredibly useful. Thank you Google developers!

Update: I ran across Thrift, which is Facebook's model for cross-language service communication. Thrift is now being spun off as a new Apache incubator project, Apache Thrift.
Thrift allows you to define data types and service interfaces in a simple definition file. Taking that file as input, the compiler generates code to be used to easily build RPC clients and servers that communicate seamlessly across programming languages.
I don't have time for a more complete comparison, but a few differences:
1) Thrift defines services and communication, Google protocol buffer is only a way to serialize data
2) Thrift is a c-like language, Google's protocol buffers is a specialized data language with very simple structures

2 comments:

  1. Hey, I love your blog. One note: Thrift isn't a C-like language at all. It's a data type and procedure types specification language. Its data type specification component is very similar to Google Protocol Buffers. There was at some point talk of using Thrift's serialization system to make essentially the same thing as PB. Or you could build a Thrift-like RPC system on top of PB (as I believe Google has done internally, but not released).

    ReplyDelete
  2. Thanks Brendan.

    I admit that my comparison was thrown together at the last minute when I found Thrift ;-).

    I briefly looked through the tutorial and saw "include" for packages as well as the "c style" typedefs and enums. Also on the service definitions, "A method definition looks like C code." All of this led me to that conclusion.

    Maybe this weekend I'll get to play with them a little more.

    ReplyDelete