Yahoo! migrated a key piece of search infrastructure, the Webmap to the Hadoop project. See the post on the the Yahoo! distributed processing blog.
The really interesting part is Jeremy's interview with Arnab Bhattacharjee (manager of the Yahoo! Webmap Team) and Sameer Paranjpye (manager of Yahoo!'s Hadoop development team). What follows is my summary of the interview.
Arnab and Sameer are both experienced engineers that started at Inktomi (which Yahoo! acquired for its core search technology) and have been working in search for about eight years.
What is Webmap?
Webmap is basically a database that describes everything Yahoo! knows about the web. First, it represents a directed graph of the web, a snapshot of the web graph. It also has all metadata on every URL on the web. It contains hundreds of features for every document, which are used in the machine learning based ranking algorithm to rank documents.
Examples of metadata in webmap:
Content of the URL
Features derived from clustering of content based on site, host, [duplicate content], etc...
Inference for data on uncrawled urls: i.e. everything in this directory exhibits X property.
Altavista's Connectivity Server
The Connectivity Server was a pre-cursor to Webmap. It was the first major web graph system was Altavista's Link Connectivity Server.
Inktomi created the "Webmap" system. The first Webmap was on the order of one hundred to five hundred million URLs. It ran on 20 or so machines and used perl scripts to manage the computation.
Rise of DreadNaught
The perl system was quickly replaced by Dreadnaught. Dreadnaught is essentially a lightweight version of map-reduce. It was first deployed on 40 nodes in 2001 and it is now 1000+ nodes, but began to break down. It's major problem was that it did not have load-balancing or fault tolerance built into it. Instead, it had storage and compute bound together on a single node. Thus, if one node went down the entire system went down. Also, the slowest node became a bottleneck on cluster throughput.
In 2005, the search team wanted its own grid platform. A key reason was that they wanted to enable the application developers and researchers to decouple application logic from the underlying systems (such as Dreadnaught). They decided to get involved with Hadoop and brought Doug Cutting onboard. The created the Grid team and spent a lot of time working on Hadoop's scalability and interfacing it with the existing C++ systems.
The Grid Team created the Pipes API to interface the existing C++ system (25+ man years of development) with the Hadoop 'scaffolding'. Thus, even though the jobs are managed by Hadoop, the core inner loop is still C++.
Hadoop based Webmap
The current Hadoop Webmap system is 2000+ nodes and went online last month. The new system produces results in 1.5 as fast as DreadNaught. The new architecture allows them to split the job into smaller pieces to run on more nodes, reducing processing inequality of a shard without having to worry about node failures.
The current web graph has about 1 trillion edges. In the last process the input was 200 terabytes of data and the output was 300 terabytes. The data is replicated 3x for redundancy, resulting in about 1 petabyte of data. The new platform actually hosts multiple snapshots of the Webmap and has over 5 petabytes of raw storage capacity.
The web doubles every year. The team would like to run a cluster of 20,000 nodes. They want to store 100 petabytes of data, with hundreds of millions of files in directories. Also, their grid model needs to be extended handle long running services such as crawling and querying. There is a lot of work to do on scheduling jobs to run on the web and balancing these with long running jobs.
Hadoop is beginning to hit mainstream use in large production systems. Very cool! As I previously mentioned, Hadoop is also supported by PowerSet, the semantic search startup. Powerset is supporting the HBase project, a Hadoop based clone of Google's BigTable to store their web metadata.