Thursday, August 23

PowerSet Data Center Modeling

Steve Newcomb, founder and COO of Powerset wrote a blog post about it's data center model.

Steve provides the model in a flash application: Powerset Indexing Center Datacenter Dashboard

The application models several important factors in Powerset's cost model:
  • Index Size - How many servers are required to crawl and store a known portion of the Web?
  • Moore's Law - instead of modeling Moore's law as a trend line, we broke it out into its 2 components Server Speed and Server Cost
  • Lease vs. Buy - What drives a decision to lease servers versus paying cash?
  • Lease vs. EC2 - What drives a decision to lease servers versus virtual computing (e.g. EC2)?
Powerset's NLP analysis of documents during indexing is CPU intensive. In past presentations, Steve has given some rough estimates on their current indexing speed, I seem to recall on the order of 1 document/second.

It's common knowledge that Powerset has been using Hadoop on Amazon's EC2. They are likely using this to get a jump start on building scaled data before they have their own cluster in place.

The application is interesting to play with, give it a try.

