Monday, December 28

Dean and Ghemawat Strike Back on MapReduce

Jeff Dean and Sanjay Ghemawat wrote an article for the January edition of CACM, MapReduce: A Flexible Data Processing Tool. In the article, they refute the findings of A Comparison of Approaches to Large-Scale Data Analysis. On their blog, the authors also wrote a post bashing MapReduce: MapReduce, A major step backwards. The post is no longer available, but thankfully Greg had good coverage.

In the article Dean and Ghemawat address the paper and attempt debunk its claims, although they lack the benchmarks to back it up. In the process, they inform you about the right way to run M/R jobs efficiently:
  1. Avoid starting processes for each new job, reuse workers.
  2. Careful data shuffling, avoid O(M*R) disk seeks
  3. Beware of text storage formats.
  4. Use natural indices like timestamps on files.
  5. Do not merge reducer output.
They present some good M/R lessons in their refutation. You should be using a binary serialization system like Avro or Protocol Buffers and storing your data in a format that provides efficient access, using a natural file structure or using a database system like HBase.

1 comment:

  1. The "MapReduce is backwards" article moved after some time and now has completely disappeared. I suspect it has something to do with Vertica recently partnering with Cloudera and supporting Hadoop connectivity within the Vertica product. Still, doesn't make Stonebraker's writings any less unbiased.