Friday, April 10

NSF Graduate Research Fellowships Announced

The prestigious and competitive NSF GRFP grant awards were announced this morning.

Congratulations to Jackie and Ben on their awards! They are both very deserving.

For some of us, there's still next year.

Thursday, April 9

Hadoop on Windows Tutorial

Don't try it. Just don't. It's not worth the hassle. Instead, dual-boot or install VMWare server and get a copy of Ubuntu... and use the Cloudera distribution of Hadoop.

But if you want to do it the hard and annoying way:

Vlad Korolev, a PhD student at UMBC has a very useful tutorial on setting up an Hadoop environment on Windows with Eclipse integration.

Personally, even after I got it working, I quickly got sick of having to use Cygwin and gave up on it.

Glasgow Releases New Blog '08 Test Collection

Iadh just announced on the Terrier Team blog that they released the new collection of blog data that will be used for the TREC 2009 Blog Track (details on the wiki). If you plan on participating, he suggests filling out the paper work asap. You can also see the stats. A high level view is that it has 28,488,767 blog posts from 1,303,520 blog feeds crawled from Jan 2008 to Mid Feb 2009.

The time period for this dataset overlaps significantly with the full-web ClueWeb09 crawl. This will make cross-dataset connections interesting.

Thanks to the team at Glasgow for the work making this happen.

Out of curiousity, I wonder if it's somehow possible to put together a parallel Twitter corpus...

Yahoo! Expands M45 Access

Not to be confused with the swedish sub-machine gun, the M45 is one of Yahoo's compute clusters. Today Yahoo! Research announced that it is exanding use of the cluster to three additional universities beyond CMU. Carnegie-Mellon has been working on the cluster since last year. The three new universities are UMass Amherst, UC Berkeley, and Cornell.

The M45 cluster runs Hadoop for large-scale map-reduce data processing tasks. According to the report, the cluster has approximately 4,000 processor-cores and 1.5 petabytes of disks.

On some of UMass's plans the article writes,
“Yahoo!’s supercomputing cluster will enable us to do data-intensive research on a large set of scanned books drawn from the Internet Archive’s million-book collection. The latter includes 8.5 terabytes of text and half a petabyte of scanned images. Research on such large datasets would not be possible without the use of clusters like the one Yahoo! is offering us access to.”
Some of the team here in the IR lab have been working on the Million Book Collection as part of a grid computing text mining seminar offered this term.

Beyond the OCA collection, I also hope to be able to run experiments on the new billion document web corpus, ClueWeb09.

Wednesday, April 8

Map-reduce machine learning: Mahout 0.1 Release

Grant announced on his blog the release of Apache Mahout 0.1. Mahout is an effort to port several standard machine learning algorithms to the Hadoop map-reduce framework.

Grant gives an update on the algorithms integrated so far:
We have several clustering algorithm implementations: k-Means, fuzzy k-Means, Dirichlet, Mean-Shift, Canopy. We also have implementations of naive bayes and complementary naive bayes for classification and some integration with the Watchmaker evolutionary programming framework.
The webpage should be updated with the link to the release shortly.

You can also check out Grant's slides from ApacheCon and the Mahout Wiki.