Thursday, October 27

CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms

Experiences Evolving a New Analytical Platform: What Works and What's Missing
Jeff Hammerbacher, Cloudera

Built the infrastructure team at Facebook, 0 to 2 PB of data

Take the infrastructure and make it available as open source.

The true challenges in the task of data mining.  Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques

Exploratory data processing (IBM)

Taught the data science course at Berkeley earlier this year

1) Store all your organization's data in one place
  - data first, questions later
  - store first, structure later

Engineers are constrained when you force them to stop and model the data, which is constantly evolving.

Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance > 50 PB on commodity hardware in one center

 Enable everyone to party on the data.  Use files because developers are not analysts.

Like the LAMP stack, there is a coherent analytical data management

Better underlying abstractions

Platform - Substrate
 - commodity servers (a big warehouse)
 -- open compute project (FB open source)
 - open source OS
  -- Linux
 - Open source config management
  -- Puppet, Chef
 - Coordination service
  -- ZooKeeper

Platform - Storage
 - Distributed, schemaless storage
 --> HDFS, Ceph (UCSC), MapR
 - Append-only table storage and metadata
  --> Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)
 - Mutable table storage and metadata
 -- HBase

- Cluster resource management
 -- YARN (inter-job scheduling, like grid engine, for data intensive computing)
 -- Mesos
- Processing Framworks
 -- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel
- High level interfaces
  -- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive

- Tool access
- Data ingest
 -- Sqoop, Flume
(Documents ingest is still an area that needs work.  There are crawlers, but they're still immature)

fat servers with fat pipes
2u, 24 gb ram, 12 drives, (bigger nodes)
os support for isolation (VMs have downsides)
Linux containers
  -- Google contributed initial patches, used for BORG
Local files system improvements
 -- btrfs

 - scalaql


  1. Why have you decided to mystificate money? Why do you think this is the right decision? I saw the article on rewardedessays where the author said that you need always treat your money very serious. And I don't know which point of view to support...

  2. Thanks much for sharing you need always treat your money very serious. And I don't know which point of view to support.
    Send Flowers To OMAN

  3. Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.

    Best Linux training in Noida
    Linux Training Institute in Noida
    Shell Scripting Training Institute in Noida