In the article Dean and Ghemawat address the paper and attempt debunk its claims, although they lack the benchmarks to back it up. In the process, they inform you about the right way to run M/R jobs efficiently:
- Avoid starting processes for each new job, reuse workers.
- Careful data shuffling, avoid O(M*R) disk seeks
- Beware of text storage formats.
- Use natural indices like timestamps on files.
- Do not merge reducer output.
They present some good M/R lessons in their refutation. You should be using a binary serialization system like Avro or Protocol Buffers and storing your data in a format that provides efficient access, using a natural file structure or using a database system like HBase.
1 comments: