Wednesday, June 17

Future of Hadoop and Trends

I'm writing up my notes from last week's Hadoop Summit. Here are my notes from the Futures Panel as well as some other thoughts on Hadoop's evolution.

Futures Panel
Core by Sanjay Radia
  • Focus improving backwards compatibility (with 0.21 release)
  • APIs and configuration improvements – API cleanup slotted for 0.21
  • Near-term: Product file system
  • Beyond 2009: support for wide area and federated file systems. Live applications and online saving of data.
  • Performance – Write latency improvements, especially due to bad disks
  • Throughput – support for 5K nodes and 20K clients
  • Append and Flush support (allow applications to know when data has been written to disk)
  • Faster restart of Namenode server (it currently can take up to 15 minutes to load logs into memory and resume state)
Map/Reduce – Owen O'Malley, Y!
  • Move HDFS and Map/Reduce out (split projects)
  • 10 minute test and build process (goal: 0 warnings)
  • Stable interfaces
  • Commit the major shuffle phase re-write used for terabyte sort (low latency)
Shedulers – Y! uses Capacity scheduler, Facebook uses Fair Share. In 0.21 they will add support for dynamic priorities.

Oozie – A server based workflow engine for M/R jobs. It supports a directed acyclic graph (DAG) of actions.

AVRO – Doug Cutting, Y!
Yet another serialization system
  • Single system for both data serialization and RPC
  • Support Dynamic data types (self-describing data)
  • High-performance bulk data transfer in RPC
vs. Procol Buffers And Thrift
  • Both require code generation, which is awkward for scripting systems
  • Data is tagged (making it larger)
  • Use numeric field identifiers which is retro (not good)
The language for Schema (IDL) is JSON.

Hive and Hadoop @ Facebook by Joydeep Sarma
- See Hive Wiki. Lot of query language improvements, metadata enhancements (schema evolution), and performance (add indexes)

  • Performance improvements (latency). It is now 1.2-1.3x normal M/R jobs
  • Add a SQL interface (90% done). Make UDFs work in both.
  • Add metadata for grid (hierarchical data sets)
  • New storage access layer that is column-oriented. This allows fast projection and early row filtering. The new storage layer will provide great utility to HBase and other projects as well.
One area of active discussion is the fragmentation over different approaches to similar problems, particularly between Facebook and Yahoo!. Yahoo! is adding support for SQL to PIG which competes directly with Facebook's HIVE system. FB and Y! developed their own work flow systems, use different scheduling systems, and serialization frameworks (AVRO vs Thrift), and the list goes on. In the short-term this is creating a complicated landscape for new users. However, hopefully the competition and diversity of approaches will result in better products.

Other Trends
One clear trend was adoption by users outside of the core Java community. There was an analogy that writing pure M/R is like writing in assembly and the trend is to write applications in the newer high-level languages. The development of these new languages and tools is leading to increased Hadoop adoption by data warehousing specialists, analysts, and even non-programmers. Even for programmers, there is a trend away from pure M/R in Java to using less verbose languages.

Streaming is a utility that allows you to write MapReduce jobs with any executable or script. It has a text input and output model. This serialization overhead leads to a 20-100% overhead in your job (although the average is usually 20-30%). This lets you write your code in Ruby or Python.

PIG is a high-level declarative language with support for user-defined functions in multiple different languages. It's widely used at Y! for analytics in advertising and search log analysis. It will soon support a subset of SQL to make it easier to use and be faster as a result of further performance optimizations. Again, user functions are often being defined in Python and Ruby.

HIVE is a SQL layer for M/R data processing. It's used at Facebook for their data warehousing and analytics.

JAQL is a declarative language from IBM Research that lets programmers work at multiple levels of abstraction (not just high-level like PIG or low-level map-reduce). It provides a detailed query plan of what will run that can be modified and changed to suit your needs

Cascading - Allows complex data flows and processing. This is one of the new hot frameworks that got a lot of talk at the conference. I admit I'm not too familiar with it, but I'm interested in learning more.

There was an emphasis on rapid iteration and prototyping applications on big data quickly. These frameworks emphasize ease of use over performance. Performance is less of a concern when you have a large cluster (or Amazon EC2) and can throw massive amounts of hardware at the problem.

Monday, June 15

Large Graph Analysis at Google with Pregel

Today, on the Research blog Google talked about Pregel, the large-scale graph analysis system they use on the web graph and other graphs. Details are still emerging, but read the blog post for details.