Thursday, October 27
CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms
Experiences Evolving a New Analytical Platform: What Works and What's Missing
Jeff Hammerbacher, Cloudera
Built the infrastructure team at Facebook, 0 to 2 PB of data
Take the infrastructure and make it available as open source.
Philosophy
The true challenges in the task of data mining. Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques
Exploratory data processing (IBM)
Taught the data science course at Berkeley earlier this year
1) Store all your organization's data in one place
- data first, questions later
- store first, structure later
Engineers are constrained when you force them to stop and model the data, which is constantly evolving.
Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance > 50 PB on commodity hardware in one center
Enable everyone to party on the data. Use files because developers are not analysts.
Like the LAMP stack, there is a coherent analytical data management
Better underlying abstractions
Platform - Substrate
- commodity servers (a big warehouse)
-- open compute project (FB open source)
- open source OS
-- Linux
- Open source config management
-- Puppet, Chef
- Coordination service
-- ZooKeeper
Platform - Storage
- Distributed, schemaless storage
--> HDFS, Ceph (UCSC), MapR
- Append-only table storage and metadata
--> Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)
- Mutable table storage and metadata
-- HBase
Compute
- Cluster resource management
-- YARN (inter-job scheduling, like grid engine, for data intensive computing)
-- Mesos
- Processing Framworks
-- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel
- High level interfaces
-- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive
Platform
Integration
- Tool access
- Data ingest
-- Sqoop, Flume
(Documents ingest is still an area that needs work. There are crawlers, but they're still immature)
Trends
fat servers with fat pipes
2u, 24 gb ram, 12 drives, (bigger nodes)
os support for isolation (VMs have downsides)
Linux containers
-- Google contributed initial patches, used for BORG
Local files system improvements
-- btrfs
language
- scalaql
Subscribe to:
Post Comments (Atom)
0 comments:
Post a Comment