Thursday, August 23

Powerset in the news

Continuing my theme today catching up on Powerset news, about a month ago, the MIT Review had an article on Powerset and other NLP based engines:
Building A Better Search Engine

The article also mentions IBM Avatar, which I have previously discussed.

Another reminder, if you missed Barney Pell's (Founder of Powerset) talk at UW in May, you can still catch the video.

Chad Walters and Patrick Tufts blogs

Today, I came across two interesting blogs that I wanted to share.

Chad Walters
Chad is the Search Architect at Powerset. His boss, Steve Newcomb, has an interview up on Steve's blog. Chad is a veteran of Yahoo Search, where he was the Lead Architect for Runtime Search under Sean Suchter (who was one of the leads on Inktomi).

Chad's blog has a great introductory article on query result and posting list caching in search engines (static versus dynamic caching). He hasn't blogged in awhile, so let's help he gets some more time!

Patrick Tufts
Patrick is an 'AI guy' working on Freebase for Metaweb (see my previous discussion). According to his blog he also invented one of the two product recommendation engines used at Amazon (cool!). Speaking of which, FreeBase just announced an open Alpha.

PowerSet Data Center Modeling

Steve Newcomb, founder and COO of Powerset wrote a blog post about it's data center model.

Steve provides the model in a flash application: Powerset Indexing Center Datacenter Dashboard

The application models several important factors in Powerset's cost model:
  • Index Size - How many servers are required to crawl and store a known portion of the Web?
  • Moore's Law - instead of modeling Moore's law as a trend line, we broke it out into its 2 components Server Speed and Server Cost
  • Lease vs. Buy - What drives a decision to lease servers versus paying cash?
  • Lease vs. EC2 - What drives a decision to lease servers versus virtual computing (e.g. EC2)?
Powerset's NLP analysis of documents during indexing is CPU intensive. In past presentations, Steve has given some rough estimates on their current indexing speed, I seem to recall on the order of 1 document/second.

It's common knowledge that Powerset has been using Hadoop on Amazon's EC2. They are likely using this to get a jump start on building scaled data before they have their own cluster in place.

The application is interesting to play with, give it a try.