CSE 490H: Scalable Systems: Design, Implementation and Use of Large Scale Clusters
The topics covered are Map/Reduce, MapReduce algorithms, distributed file systems like the Google File System, cluster monitoring, power and availability issues. The course is taught by Ed Lazowska and Aaron Kimball. The class uses the widely used Hadoop Map-Reduce framework created by Doug Cutting and Yahoo! to give students hands on experience.
The four class assignments help students become familiar with real-world tools and tasks:
- Setup and test Apache Hadoop, using it to count words in a corpus and build and inverted index
- Run PageRank on Wikipedia to find the most highly cited articles.
- Assignments 3-4 build a rudimentary version of Google Maps.
Assignment 3 create maps and tiles of the US from geographic survey data
- Use Amazon S3 storage and EC2 compute cluster to lookup addresses on the maps created in assignment three and connect it to a web-front end.
Also the videos and slides of the lectures are available to view/download. This is fantastic because the speakers in the class look really interesting, such as Jeff Dean from Google and Werner Vogels from Amazon speaking about the tools and their future directions.
The class is a great quick-start on using Hadoop for cluster computation.
On a related note, you may also want to look at the lectures and materials for a mini-course on cluster computing for the Google interns.
Here at UMass we do large-scale indexing using a Map-Reduce like framework called TupleFlow that powers the Galago search engine; both were written by Trevor Strohman (now at Google).