Thursday, April 26

Friday News: SIAM Data Mining Proceedings, LingPipe 3.0, and fun with Pig, Sawzall, and DryadLinq

SIAM Data Mining 2007
The SIAM Data Mining Conference is happening this week in Minneapolis. Daniel Lemire has coverage on his blog. All of the proceedings are available online for download (I with the ACM did this). Here are some highlights:

Best Paper Awards
Research: Less Is More: Compact Matrix Decomposition for Large Sparse Graphs
Authors: J. Sun, Y. Xie, H. Zhang and C. Faloutsos

Application: Harmonium Models for Semantic Video Representation and Classification
Authors: J. Yang, Y. Liu, E. Xing and A. Hauptmann

Another paper that looked interesting was:
Bandits for Taxonomies: A Model-based Approach by Sandeep Pandey, Deepak Agarwal, Deepayan Chakrabarti and Vanja Josifovski (all of Yahoo Research). The problem here is to match contextual ads to web pages as efficiently as possible, even when clicks (feedback) are rare. One of the tricks described is to use taxonomy matching -- classifying web pages into a hierarchical taxonomy (such as the Yahoo Directory) and then classifying ads into the taxonomy. They can then exploit relationship within the taxonomy to find other similar content. They put an interesting spin on it by framing the problem as a "multi-armed bandit problem." See the Wikipedia entry on the Multi-armed bandit problem for background on a very interesting gambling problem ;-).

LingPipe 3.0
Alias-i has released LingPipe 3.0. There are full details on the new version on the LingPipe blog. The new system moves to Java 1.5 and uses generics. There is a great story about the upgrade process: Spring Cleaning Generics for Lingpipe 3.0. Generics are awesome -- and I love the for-each loop. Also, the clustering package was re-written from the ground-up; there is a new clustering tutorial as well.

Distributed Processing Abstractions: Pig, Sawzall, and DryadLinq
These are programming models designed to enable mere mortals to write programs that seamlessly scale for parallel processing on large computing clusters. In short, they are tools that enable efficient large scale data manipulation over web pages, query logs, etc... These languages usually (with the exception of Dryad) run on a map-reduce framework (such as Yahoo's Hadoop). All three of the major search engines are building languages to perform large scale distributed data processing:

The Pig Project from Yahoo (An open-source, Java, add-on to Hadoop).
The highest abstraction layer in Pig is a query language interface, whereby users express data analysis tasks as queries, in the style of SQL or Relational Algebra. Queries articulate data analysis tasks in terms of set-oriented transformations, e.g. apply a function to every record in a set, or group records according to some criterion and apply a function to each group.
DryadLinq from Microsoft (Distributed Systems and Web Search and Data Mining teams).

A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator which can synthesize any directed acyclic graph... Dryad handles job creation and management, resource management, job monitoring and visualization, fault tolerance, re-execution, scheduling, and accounting.

Dryad is closed source, written using .Net and C#.

Sawzall from Google

Greg has coverage on them (Yahoo Pig and Google Sawzall) and goes into some depth on some of the similarities and differences in the languages.

Tuesday, April 24

Images on Windows Live Products, An Improvement?

Recently Google renamed Froogle and gave it an upgrade, now it is Microsoft's turn with upgrades to Live Product Search.

Microsoft has a post up on their blog, Live Product Search More Images, More Relevant. According to their latest information 88.6% of the products now have images (a 9% improvement over the old system). Why aren't there 100% images in the top results?
The reason is largely because many sites, including very reputable merchants like,, and, block image crawling bots or seriously throttle them. We will have to work with these sites to address these issues, but the latest improvements in the number of Product Search top results with images are already quite significant.
Webmasters are weird about crawlers. Some complain even when the search engine will drive traffic to their site... and then there are the really insane webmasters who complain when you make 100 hits to their site. Some appear to have nothing to do but pour over their weblogs. Still, this it is surprising that MSN is having these problems with major retailers.

I don't buy MSN's 88.6%, at least not from a user perspective. I tried some queries and I'm seeing much worse results. See XBox 360 (10 out of 18 have images), ipod (12 out of 18), and a hard one kershaw shun (a brand of knife) (0 out of 18). This leads to my overall rating of approximately 41%. Compare this with Google. XBox 360 (7 out of 10), Ipod (10 out of 10, Kershaw Shun (10 out of 10). Google gets 90%. My unscientific 41% is a long ways off from MS's claim of 89%. Now I can't see what is in their entire database, but from my user experience something is fishy here.

In other product search related news Microsoft news... MS is working on Cloud DB (coverage via Geeking with Greg), a similar product to Google's BigTable. The key problem here is: how do you handle sparsely populated columns efficiently? From what appears to be some kind of leaked discussion on Cloud Db:
MSN Shopping. The total set of attributes that products can have (e.g. “Pixel Resolution”) is very large, but any given product only has a few (a vacuum cleaner doesn’t have ‘Pixel Resolution’).
A good review of BigTable and eventually BigTable and S3 is something for another night...

Future of Search Event at UC Berkeley

There has been a lot of talk about the future of search recently, Hakia's Quest for Better Search.

Matthew Hurst pointed out that the Future of Search 'research event' at UC Berkeley is coming up on May 4th. It is billed as an opportunity for students and academics to get together and talk with industry, to set agendas for relevant research.
This event will examine the path towards the next generation of Search. This requires new technology for its development, engineering design and visualization. As the technological expertise for each component becomes increasingly complex, there is a need to better integrate them into a global model. The ultimate goal is to understand how we can fully mechanize search engines with cognitive and natural language capabilities. This event will endeavor to construct an overview of what is to come, to elucidate and formulate the main open questions in this grand quest and to highlight promising research directions.
About half the day will be taken up with three panels: NLP, Communities and Search, and Multi-Media.

Matthew is participating in the communities and search panel. The speakers and panels all look really interesting -- I wish I could attend! If you are near Berkeley, don't miss it. Registration is free!