In the article Dean and Ghemawat address the paper and attempt debunk its claims, although they lack the benchmarks to back it up. In the process, they inform you about the right way to run M/R jobs efficiently:
- Avoid starting processes for each new job, reuse workers.
- Careful data shuffling, avoid O(M*R) disk seeks
- Beware of text storage formats.
- Use natural indices like timestamps on files.
- Do not merge reducer output.
They present some good M/R lessons in their refutation. You should be using a binary serialization system like Avro or Protocol Buffers and storing your data in a format that provides efficient access, using a natural file structure or using a database system like HBase.

1 comments:
The "MapReduce is backwards" article moved after some time and now has completely disappeared. I suspect it has something to do with Vertica recently partnering with Cloudera and supporting Hadoop connectivity within the Vertica product. Still, doesn't make Stonebraker's writings any less unbiased.
Post a Comment