Thursday, April 9

Yahoo! Expands M45 Access

Not to be confused with the swedish sub-machine gun, the M45 is one of Yahoo's compute clusters. Today Yahoo! Research announced that it is exanding use of the cluster to three additional universities beyond CMU. Carnegie-Mellon has been working on the cluster since last year. The three new universities are UMass Amherst, UC Berkeley, and Cornell.

The M45 cluster runs Hadoop for large-scale map-reduce data processing tasks. According to the report, the cluster has approximately 4,000 processor-cores and 1.5 petabytes of disks.

On some of UMass's plans the article writes,
“Yahoo!’s supercomputing cluster will enable us to do data-intensive research on a large set of scanned books drawn from the Internet Archive’s million-book collection. The latter includes 8.5 terabytes of text and half a petabyte of scanned images. Research on such large datasets would not be possible without the use of clusters like the one Yahoo! is offering us access to.”
Some of the team here in the IR lab have been working on the Million Book Collection as part of a grid computing text mining seminar offered this term.

Beyond the OCA collection, I also hope to be able to run experiments on the new billion document web corpus, ClueWeb09.

