The M45 cluster runs Hadoop for large-scale map-reduce data processing tasks. According to the report, the cluster has approximately 4,000 processor-cores and 1.5 petabytes of disks.
On some of UMass's plans the article writes,
“Yahoo!’s supercomputing cluster will enable us to do data-intensive research on a large set of scanned books drawn from the Internet Archive’s million-book collection. The latter includes 8.5 terabytes of text and half a petabyte of scanned images. Research on such large datasets would not be possible without the use of clusters like the one Yahoo! is offering us access to.”Some of the team here in the IR lab have been working on the Million Book Collection as part of a grid computing text mining seminar offered this term.
Beyond the OCA collection, I also hope to be able to run experiments on the new billion document web corpus, ClueWeb09.