Thursday, April 26

Friday News: SIAM Data Mining Proceedings, LingPipe 3.0, and fun with Pig, Sawzall, and DryadLinq

SIAM Data Mining 2007
The SIAM Data Mining Conference is happening this week in Minneapolis. Daniel Lemire has coverage on his blog. All of the proceedings are available online for download (I with the ACM did this). Here are some highlights:

Best Paper Awards
Research: Less Is More: Compact Matrix Decomposition for Large Sparse Graphs
Authors: J. Sun, Y. Xie, H. Zhang and C. Faloutsos

Application: Harmonium Models for Semantic Video Representation and Classification
Authors: J. Yang, Y. Liu, E. Xing and A. Hauptmann

Another paper that looked interesting was:
Bandits for Taxonomies: A Model-based Approach by Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti and Vanja Josifovski (all of Yahoo Research). The problem here is to match contextual ads to web pages as efficiently as possible, even when clicks (feedback) are rare. One of the tricks described is to use taxonomy matching -- classifying web pages into a hierarchical taxonomy (such as the Yahoo Directory) and then classifying ads into the taxonomy. They can then exploit relationship within the taxonomy to find other similar content. They put an interesting spin on it by framing the problem as a "multi-armed bandit problem." See the Wikipedia entry on the Multi-armed bandit problem for background on a very interesting gambling problem ;-).

LingPipe 3.0
Alias-i has released LingPipe 3.0. There are full details on the new version on the LingPipe blog. The new system moves to Java 1.5 and uses generics. There is a great story about the upgrade process: Spring Cleaning Generics for Lingpipe 3.0. Generics are awesome -- and I love the for-each loop. Also, the clustering package was re-written from the ground-up; there is a new clustering tutorial as well.

Distributed Processing Abstractions: Pig, Sawzall, and DryadLinq
These are programming models designed to enable mere mortals to write programs that seamlessly scale for parallel processing on large computing clusters. In short, they are tools that enable efficient large scale data manipulation over web pages, query logs, etc... These languages usually (with the exception of Dryad) run on a map-reduce framework (such as Yahoo's Hadoop). All three of the major search engines are building languages to perform large scale distributed data processing:

The Pig Project from Yahoo (An open-source, Java, add-on to Hadoop).
The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra. Queries articulate data analysis tasks in terms of set-oriented transformations, e.g. apply a function to every record in a set, or group records according to some criterion and apply a function to each group.
DryadLinq from Microsoft (Distributed Systems and Web Search and Data Mining teams).

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph... Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

Dryad is closed source, written using .Net and C#.

Sawzall from Google

Greg has coverage on them (Yahoo Pig and Google Sawzall) and goes into some depth on some of the similarities and differences in the languages.

1 comment: