Thursday, March 25

Rejected SIGIR 2010 Paper? Consider Not Relevant

Yesterday the SIGIR paper notifications went out. I didn't submit a paper, but I know many other people who did. The results were mixed. I believe some quality papers were rejected.

Did your paper get rejected unfairly? Poor reviewing? You should check out: Not Relevant, a new online journal for unfairly rejected SIGIR papers.

You can also join the discussion on twitter.

Wednesday, March 24

AMPLab: Exploring Big Data With Algorithms, Machines, and People

Today, there was a talk by Michael Franklin on the future of big data covering the databases vs. NoSQL movement in terms of WoW: The Alliance vs. the Horde, a struggle for the future of big data. More on that later.

Michael mentioned that he is excited about a new project that is spinning up called AMPLab. The effort explores the combination of distributed (machine learning) algorithms with crowdsourcing.

To explore the research area they created a course, AMPLab, CS294. From the first lecture one of the main goals is to:
Enable lots of people to collaborate (knowingly or not) to collect, generate, clean, make sense of and utilize lots of data.
It will try to address one of the key problems, scalability:
  • Scale state-of-the-art ML to large datasets (building on efforts like Spark)
  • Enable data analytics frameworks to handle incomplete, heterogeneous, dirty data
  • Simplify distributed processing models
  • Use Active Learning to direct large numbers of people to improve data quality
Check out the course for details as the project is getting started.

Tuesday, March 23

Mahout 0.3 release

There has been a steady stream of releases. Another release last week was Mahout 0.3. Mahout is a library for machine learning algorithms on Hadoop.

The new release has a variety of improvements and new features, which you can read in the release notes. Here are the highlights from the announcement:
  • New: math and collections modules based on the high performance Colt library
  • Faster Frequent Pattern Growth(FPGrowth) using FP-bonsai pruning
  • Parallel Dirichlet process clustering (model-based clustering algorithm)
  • Parallel co-occurrence based recommender
  • Parallel text document to vector conversion using LLR based ngram generation
  • Parallel Lanczos SVD(Singular Value Decomposition) solver
  • Shell scripts for easier running of algorithms, utilities and examples

Spring 2010 Courses To Watch

Here are some of the courses I've run across that I found interesting. I'll start with the ones I'm currently taking here at UMass:

CS645 - Advanced Databases
STAT 608 - Bayesian Statistics by Michael Lavine, see Chapters 5, 6, and 7 of his book.

There's also a very interesting DB seminar, Large Scale Data Analysis by Yanlei which takes a DB perspective on MapReduce and other large-scale data analysis problems.

Now here are a few from elsewhere:

Jimmy Lin is teaching his course, Data-Intensive Information Processing Applications with MapReduce.

Eugene Agitchtein's course on IR and Web Search at Emory.

(Added 3/25)
Michael Jordan at Berkeley is teaching a course on Bayesian Modeling and Inference.

Soumen Chakrabarti is teaching Organizing Web Information which focuses on extraction of information, like entities.

Let me what good ones I'm missing!

Terrier 3.0 released

While I'm catching up, I wanted to highlight the recent release of Terrier 3.0 from the IR group at the University of Glasgow. You should read their blog post on the topic. You can also read its documentation. From Craig's announcement:
- support for indexing WARC collections (such as ClueWeb09);
- improved MapReduce mode indexing
- improved and more scalable index structures
- added field-based and proximity term dependence models, such as BM25F, PL2F and Markov Random Fields
- new Web-based retrieval interface
My belated congratulations to Iadh, Craig, and the rest of the team on the release.

Monday, March 22

Many WSDM 2010 videos online

Most of the videos from the presentations at WSDM 2010 were posted late last week.

The keynotes, including the one from Soumen Chakrabarti, are still locked. However, the slides from Chakrabarti's talk, Bridging the Structured-Unstructured Gap are available from his website.