Tuesday, April 27

Social Graph Storage and Analysis at Twitter

A quick pointer to a recent presentation by Kevin Weil on data storage and crunching at Twitter, NoSQL at Twitter.

Here are a few highlights of the different systems he describes:
  • Scribe is a distributed event logging framework open sourced by Facebook. The log data is then stored in Hadoop and analyzed using PIG. See the related Elephant-Bird project for reading/writing Protocol Buffer data in Hadoop.

  • HBase - used for offline data analytics imported from online data sources. For example, the data is used to power the Hawkwind people search system.

  • Scaling Twitter with Cassandra - Uses Cassandra for real-time service to store tweets

  • FlockDB - is a simplified distributed graph store (compared with Neo4J) for storing social network data. An online system to compute "who follows who" set operations. It is built on top of the Gizzard framework for sharding/replicating data stores across a cluster
Overall, it's great to see Twitter contributing to the open source ecosystem. Many of their distributed systems use Scala, which is quite interesting.

No comments:

Post a Comment