Wednesday, July 8

KDD Best Paper Award: Modeling Temporal Dynamics, a key to winning the Netflix Challenge

This week Collaborative Filtering with Temporal Dynamics won best paper award at KDD 2009. The recent Communications of the ACM has an interview with a short summary. The author, Yehuda Koren from Yahoo! research was on the winning team of the Netflix Challenge, which recently broke the 10% threshold.

The paper details how modeling people's change in tastes over time affects their ratings. From the CACM article:
...although recent data may reveal more about a user's current preferences than older data, simply underweighting older ratings loses too much valuable information for that approach to work. The trick to not tossing the baby with the bathwater is to retain everything that predicts the user's long-term behavior while filtering out temporary noise.
Fascinating. I look forward to reading this paper in detail. This has broad implications for leveraging user data in social media applications.

Citigroup Google and Bing Relevance study is bunk

SELand has a story on a recent study conducted by Citigroup analyst Mark Mahaney that compares the relevance of Google and Bing. AllThingsD also has coverage.
Over the past two weeks, we conducted 200 queries across the three major Search engines–Google, Yahoo! and Bing...After conducting the same query across all three Search sites, we picked a winner based on: 1) relevancy of the organic search results; and 2) robustness of the search experience, which included factors such as image and video inclusion, Search Assist, and Site Breakout...
According to the charts, Google returned the most relevant result 71 percent of the time, compared with Bing at 49 percent of the time and Yahoo 30 percent of the time.

Until I get more details on the study, which I wasn't able to find anywhere, I'm highly skeptical of the findings. It has serious flaws in its methodology.

Here are some reasons:
  1. Poor Metrics. The by "most relevant result" is not a good (or standard) metric. And "robustness of the search experience" is not clearly defined.

    He should've used standard metrics that people understand: precision@1, precision@3, or something similar. Or he could've conducted a user study and measure how long it took them to accomplish a standard task.

  2. "Relevance" is not defined. Binary, graded, etc... See the Google rater guidelines (summary).

  3. Annotator agreement. How many people rated each query? Could they agree with one another? Did they look at the result pages or just the snippets? These are important questions.

  4. Query selection. Taking queries from popular queries is very biased. A more realistic/random sample should be done, gathered from a company like Compete.
Mark and others doing tasks like this should consider Mechanical Turk (MT) and refer to Panos Ipeirotis for methods to handle noisy judgments. For example, if you wanted to evaluate relevance you could have raters judge pairs of documents to collect Pairwise Preference Judgments. For work using MT to collect judgments see Panos's post How good are you, Turker and a recent paper, Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers.

Tuesday, July 7

Hadoop Data Serialization Battle: Avro, Protocol Buffers, Thrift

My current research project involves processing large text corpora using Hadoop. I'll have more to say about the details in a future post. Right now, I want to focus on reading/writing custom data types in Hadoop.

For example, say you want to create an inverted index with a "Posting" data type that has a term (text), docId (long), and a array of occurrences (ints). One standard way is to write a class that implements Hadoop's Writable (and Comparable for keys) interfaces (see the Serialization section of Tom White's book). However, writing the classes is tedious, error prone, and a maintenance hassle which is done for every non-trivial value you want to write. If I can, I want to avoid this. Let's talk about other options.

The first two obvious choices are Facebook's Apache Thrift (JIRA patch code) and Google Protocol Buffers (JIRA patch). Both provide an Interface Definition Language (IDL) that is not language specific, which is nice. They plug into Hadoop via the Serializer package. You can also read Tom White's blog post from last year on the topic.

Of course, there is Streaming for non-Java jobs which serializes everything as a string via stdin and stdout. It introduces a 20-100% overhead (according to Yahoo!) in job execution performance. It's great for rapid prototyping, but maybe not as much for terabyte scale data processing!

Then of course, there is the new kid kid: Avro. Avro is a serialization and RPC framework created by Doug Cutting (Hadoop creator) after a hard look at PB and Thrift. It defines a data schema which is stored beside the data (not combined with it). The schemas are defined in JSON, so they are easy to parse in many languages. One big difference is that Avro doesn't generate code stubs; although you have the option to do so for statically typed languages.

As it happens, Doug just asked people to try out the 1.0 release candidate. It looks promising, there were several recent posts (one, two) by Johan Oskarsson from Last.Fm.

You should check out the data serialization benchmark. The benchmark results show that Avro is quite competitive already (except the timeCreate, which is odd).

In the meantime, comment and let me know your thoughts and experiences.

Tuesday, June 30

An Introduction to Bing's Query Categorization System

Brady Forest, has an interview with lead Bing PM Sanaz Ahari. It is a two part interview. Part I introduces the query categorization system. In Part II, Sanaz goes into more detail on how the systems work. Both are high-level and not very technical. However, it's an interesting insight into practical applications of query classification and clustering.

One interesting area that I want to highlight is that they are taking a broad category and starting to model the classes of entities that apply to the domain:
And we already have abilities to classify quarries into domains and understand, okay, this query is a music query or this query is health query, et cetera, et cetera. And so the other problems that fall out of that is, okay, when people do do health quarries, what are the categories that fall out of that? Like how do we know that people are going to care about diseases and symptoms, et cetera, et cetera. And then the next problem after that is how do we know that we have a comprehensive understanding of all diseases?
It's sounds like they're doing some of it by hand, or at least in a semi-supervised manner. They don't go into details, but they mention the obvious suspects: Wikipedia, query logs, and document extraction.

One interesting note is that currently 20 percent of our queries have a categorized experience. It sounds like there is still a long way to go.

Thursday, June 25

Marti Hearst's Search User Interfaces Available

Marti Hearst just completed her new book Search User Interfaces. Additionally, you can read it online free! Marti is a professor at the UC Berkeley, and a leading expert on user interface design for search. To go with a book she's started a blog, SearchUpTicious.

I've been waiting for this book a long time. I remember talking with her about it back at SIGIR 2006. More to come, in the meantime go read it for yourself. Daniel also has an early review.

via Matthew Hurst.

Eclipse Galileo is here

The new version of Eclipse, Galileo (3.5) is here, don't miss this big upgrade.

Wednesday, June 24

Linux 64-bit Desktops Not Worth Your Time

Recently, I've gotten several new computers. They are nice 64-bit computers with 6 GB+ of RAM. I have Linux on both (Ubuntu 9.04 and Centos 5.3) and I'm regretting it.

Here are some 64-bit Linux desktop grievances:
  1. No cutting-edge browser. The following browsers would be acceptable: Firefox 3.5, Google Chrome, or Safari 4. No luck. The best you can do is Firefox 3.0, but at this point it's aging, unstable, and slow. (update: there are unsupported builds for 64-bit linux if you add the right repositories)

  2. Driver issues. I spent several hours fighting with X and the NVidia drivers to get my multiple monitor support to work correctly. In fact, I'm still fighting with the drivers on my new computer, see this thread.

  3. Limited 64-bit Adobe support. This means no Acrobat Reader. There is limited Flash support with the version 10 Alpha.
Of course the usual suspects haven't improved:
  1. Movie and Game support. Netflix and my other games depend on Windows-only software: DirectX. The people at Crossover/Wine are working on it, but it's still a ways away. I don't expect anything Microsoft to work on Linux properly within my lifetime.

  2. MS Office. Yes, again you can run Crossover, etc... but there are problems with stability and fonts. Granted, this isn't a 64-bit specific problem; it's inherent in all Linux desktop platforms. Don't get me started on Open Office. It's a good tool, but it has serious compatibility and usability problems.
Not to mention all the little annoyances that come with a Linux desktop install: the GPL java and flash versions shipped out of the box that you immediately have to rip out replace because they screw things up.

Why not Windows?
I connect to Linux/Unix environments all day and night for work and other projects. Some of the tools we use only run (well) on Linux. For example, don't even waste your time trying to run Hadoop on Windows. Trust me, it's a time sink.

The reality is that I have to compromise and run both. I live between two worlds at any given time.

One of my solutions is to dual-boot with Windows 7, which works well. However, even a 64-bit Windows platform has it's own compatibility problems. For example, my venerable Cisco VPN client doesn't support it, instead I had to go with a third-party NCP. My other solution is virtualization. My computers are now fast enough where I can run Linux inside VMWare (although not the other way around; there are still Direct3D and DirectX issues).

My punchline: install Windows and run Linux in a VM (or dual-boot as a backup-plan).

A throwback to the old days of 64-bit
I remember back in 2002 at IBM we had problems with 64-bit support on the Itaniums ("Itanics") we were testing. However, with 64-bit being mainstream for several years now I didn't expect as many issues as I ran into. I still love 64-bit for servers, but as a desktop it's annoying. I'm most disappointed with Firefox for their lack of support in 3.5, c'mon, really?

End Rant.