Wednesday, August 12

The Google File System Evolved: Real-Time User Applications

The ACM has an interview with Sean Quinlan on the evolution of the Google File System.

They talk about the issues they dealt with as GFS has evolved with an emphasis on the move to a distributed master design.
Our distributed master system that will provide for 1-MB files is essentially a whole new design. That way, we can aim for something on the order of 100 million files per master. You can also have hundreds of masters.
Towards the end, Sean discusses how GFS is evolving beyond its batch design to meet the needs of user-facing and latency sensitive applications often using BigTable to store structured data:
... engineers at Google have been working for much of the past two years on a new distributed master system designed to take full advantage of BigTable to attack some of those problems that have proved particularly difficult for GFS
I'm sure the Hadoop and HBase teams will find it interesting reading. I haven't had a chance to read the entire interview in detail because I'm leaving for a week long vacation on Cape Cod. Don't expect many updates from the beach!

Tuesday, August 11

Hadoop Founder Doug Cutting Leaving Yahoo! for Cloudera

Doug Cutting, creator of the Hadoop project, is jumping ship at Yahoo! and joining Cloudera, a Hadoop centric startup offering enterprise support and services.

Doug has a post on his blog describing the move. According to the NY Times interview, the decision is unrelated to the Microsoft takeover of Y! search. Doug reports in the interview,
"This has been in the works for awhile and is unrelated," Mr. Cutting said. "I am definitely not leaving in any sort of protest, and the thing I like least about this move is that it might be perceived that way."
Congratulations to Doug on the new position and to Cloudera for the big win. Having project leaders outside Yahoo! is important for the ecosystem. As Doug works on projects for other clients, it will mean that the future of Hadoop will be driven by the needs of the greater community rather than the internal needs of Yahoo!.

Facebook Makes Updates, Photos, Links, and Videos Searchable

Facebook is rolling out new search functionality to better compete with Twitter search.

Akhil, the Engineering Directory, has a post describing the news types of content that Facebook is making searchable.
You now will be able to search the last 30 days of your News Feed for status updates, photos, links, videos and notes being shared by your friends and the Facebook Pages of which you're a fan.


Google Unveils New 'Caffeine' Search Infrastructure Update

Caffeine is a top secret project to re-rewrite of Google's indexing system. It's finally being released. According to this interview with Matt, infrastructure-wise, this compares with the BigDaddy update in 2006. There have been major changed under the hood to make indexing more flexible, faster, and more robust. According to the Google post:
For the last several months, a large team of Googlers has been working on a secret project: a next-generation architecture for Google's web search.
You can try an index served on the new archicture in the sandbox they setup to let people try it out. Notice anything different?

Matt Cutts has a post on his blog. The infrastructure team have been working hard,
...a few weeks ago, I joked that the half-life of code at Google is about six months. That means that you can write some code and when you circle back around in six months, about half of that code has been replaced with better abstractions or cleaner infrastructure...
Congratulations to the infrastructure team: I didn't notice a significant difference in the results. I expect this will help Google to significantly increase the size and freshness of their index.

You may remember Cuil. Despite getting knocked pretty hard, Cuil was not about next-generation ranking, it was about infrastructure. Read my post for details. It's not clear, but perhaps the Caffeine update tackles some of the issues that Anna Patterson, former Google infrastructure architect, recounted in a Cuil interview,
If they [Google] wanted to triple size of their index, they'd have to triple the size of every server and cluster. It's not easy or fast...increasing the index size will be 'non-trivial' exercise.

Has Google tackled these architecture issues with 'Caffeine'? We may never know.

Monday, August 10

Hadoop Summit Video Roundup

Yahoo! has posted several new videos from the Hadoop summit held in June. Here's a roundup with links to the videos posted so far:
  • State of Hadoop
    Owen O'Malley, Eric Baldeschwieler, and Yahoo!'s Hadoop team talk about their work with Hadoop over the last year, including core capabilities and related sub-projects, deployment experiences, and future directions.

  • HBase Goes RealTime
    HBase is a storage system that's built on top of HDFS. The guiding philosophy of their release: to unjava-fy everything. Some of the major changes: new key format, new file format (HFile), new query API, new result API and optimized serialization, new scanner abstractions, and new concurrent LRU block cache.

  • Hive
    In this talk, Namit Jain and Zheng Shao discuss how and why Facebook uses Hive. They present Hive's progress and roadmap and describe how the open source community can contribute to the evolution of Hive...in March 2008 the service was generating about 1TB per day in March 2008; in mid-2009, data production had increased to 10TB per day.

  • Hadoop Futures Panel
    Yahoo!'s Sanjay Radia discusses backwards compatibility and the future of HDFS; Owen O'Malley covers MapReduce and security futures; Doug Cutting, the father of Hadoop, talks about Avro, a serialization system; Cloudera's Tom White discusses tools and usability; Facebook's Joydeep Sen Sama talks about Hive; and Yahoo!'s Alan Gates looks at Pig, SQL, and metadata.

  • Scaling Hadoop for multi-core and highly threaded Systems
    Here they present the basic architecture of CMT (chip multi-threading) processors, designed by Sun for maximum throughput, and then describe the work the team did using Hadoop and other virtualization technologies to help scale CMT.

  • Running Hadoop in the Cloud by Tom White
    He opens with a discussion of the Berkeley RAD Lab paper on cloud computing and walks us through a set of definitions to a discussion of the public cloud. He sees a realm of interesting possibilities: an apparently infinite resource; the elimination of user commitment; and the pay-as you go model, which enables elasticity. Tom describes the implementation of Hadoop in this landscape.

  • Amazon Elastic MapReduce
    Amazon Web Services (AWS) evangelist Jinesh Varia presents Amazon's Elastic MapReduce, a web service that simplifies the complexity of large-scale data processing operations for a growing ecosystem of AWS users.

  • The Growing Hadoop Community
    Cloudera co-founder Christophe Bisciglia takes a detailed look at the growth and evolution of Hadoop technology and community over the past year.