There is a new paper up on Google's research website:
Interpreting the Data: Parallel Analysis with Sawzall (Draft)
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on.
Go to the site for the full paper and abstract. I'll read it later today. It's on the same list with GFS and Map Reduce, so I hope it lives up to the same standard.
The coolest thing in this so far is the movie showing the distribution of requests to Google's servers over the course of a day.
This is very interesting. Can you track the technical progress of a society (sorry for waxing philosophical like John Battelle) by the volume and type of queries executed? It will be interesting if we could track this over the course of years to see the growth of technology in rapidly emerging countries like China and parts of South America.
So now we have the volume distribution, but can we mine trends at a global level? For example, commercial queries have eclipsed sex related queries in North America. Will this trend repeat itself in Europe? Fascinating.
Thanks to Digg for the tip.