Steve provides the model in a flash application: Powerset Indexing Center Datacenter Dashboard
The application models several important factors in Powerset's cost model:
Powerset's NLP analysis of documents during indexing is CPU intensive. In past presentations, Steve has given some rough estimates on their current indexing speed, I seem to recall on the order of 1 document/second.
- Index Size - How many servers are required to crawl and store a known portion of the Web?
- Moore's Law - instead of modeling Moore's law as a trend line, we broke it out into its 2 components Server Speed and Server Cost
- Lease vs. Buy - What drives a decision to lease servers versus paying cash?
- Lease vs. EC2 - What drives a decision to lease servers versus virtual computing (e.g. EC2)?
It's common knowledge that Powerset has been using Hadoop on Amazon's EC2. They are likely using this to get a jump start on building scaled data before they have their own cluster in place.
The application is interesting to play with, give it a try.