Thursday, October 27

CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms


Experiences Evolving a New Analytical Platform: What Works and What's Missing
Jeff Hammerbacher, Cloudera

Built the infrastructure team at Facebook, 0 to 2 PB of data

Take the infrastructure and make it available as open source.

Philosophy
The true challenges in the task of data mining.  Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques

Exploratory data processing (IBM)

Taught the data science course at Berkeley earlier this year

1) Store all your organization's data in one place
  - data first, questions later
  - store first, structure later

Engineers are constrained when you force them to stop and model the data, which is constantly evolving.

Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance > 50 PB on commodity hardware in one center

 Enable everyone to party on the data.  Use files because developers are not analysts.

Like the LAMP stack, there is a coherent analytical data management

Better underlying abstractions

Platform - Substrate
 - commodity servers (a big warehouse)
 -- open compute project (FB open source)
 - open source OS
  -- Linux
 - Open source config management
  -- Puppet, Chef
 - Coordination service
  -- ZooKeeper

Platform - Storage
 - Distributed, schemaless storage
 --> HDFS, Ceph (UCSC), MapR
 - Append-only table storage and metadata
  --> Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)
 - Mutable table storage and metadata
 -- HBase

Compute
- Cluster resource management
 -- YARN (inter-job scheduling, like grid engine, for data intensive computing)
 -- Mesos
- Processing Framworks
 -- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel
- High level interfaces
  -- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive

Platform
Integration
- Tool access
- Data ingest
 -- Sqoop, Flume
(Documents ingest is still an area that needs work.  There are crawlers, but they're still immature)

Trends
fat servers with fat pipes
2u, 24 gb ram, 12 drives, (bigger nodes)
os support for isolation (VMs have downsides)
Linux containers
  -- Google contributed initial patches, used for BORG
Local files system improvements
 -- btrfs

language
 - scalaql

No comments:

Post a Comment