Wednesday, July 8

KDD Best Paper Award: Modeling Temporal Dynamics, a key to winning the Netflix Challenge

This week Collaborative Filtering with Temporal Dynamics won best paper award at KDD 2009. The recent Communications of the ACM has an interview with a short summary. The author, Yehuda Koren from Yahoo! research was on the winning team of the Netflix Challenge, which recently broke the 10% threshold.

The paper details how modeling people's change in tastes over time affects their ratings. From the CACM article:
...although recent data may reveal more about a user's current preferences than older data, simply underweighting older ratings loses too much valuable information for that approach to work. The trick to not tossing the baby with the bathwater is to retain everything that predicts the user's long-term behavior while filtering out temporary noise.
Fascinating. I look forward to reading this paper in detail. This has broad implications for leveraging user data in social media applications.

Citigroup Google and Bing Relevance study is bunk

SELand has a story on a recent study conducted by Citigroup analyst Mark Mahaney that compares the relevance of Google and Bing. AllThingsD also has coverage.
Over the past two weeks, we conducted 200 queries across the three major Search engines–Google, Yahoo! and Bing...After conducting the same query across all three Search sites, we picked a winner based on: 1) relevancy of the organic search results; and 2) robustness of the search experience, which included factors such as image and video inclusion, Search Assist, and Site Breakout...
According to the charts, Google returned the most relevant result 71 percent of the time, compared with Bing at 49 percent of the time and Yahoo 30 percent of the time.

Until I get more details on the study, which I wasn't able to find anywhere, I'm highly skeptical of the findings. It has serious flaws in its methodology.

Here are some reasons:
  1. Poor Metrics. The by "most relevant result" is not a good (or standard) metric. And "robustness of the search experience" is not clearly defined.

    He should've used standard metrics that people understand: precision@1, precision@3, or something similar. Or he could've conducted a user study and measure how long it took them to accomplish a standard task.

  2. "Relevance" is not defined. Binary, graded, etc... See the Google rater guidelines (summary).

  3. Annotator agreement. How many people rated each query? Could they agree with one another? Did they look at the result pages or just the snippets? These are important questions.

  4. Query selection. Taking queries from popular queries is very biased. A more realistic/random sample should be done, gathered from a company like Compete.
Mark and others doing tasks like this should consider Mechanical Turk (MT) and refer to Panos Ipeirotis for methods to handle noisy judgments. For example, if you wanted to evaluate relevance you could have raters judge pairs of documents to collect Pairwise Preference Judgments. For work using MT to collect judgments see Panos's post How good are you, Turker and a recent paper, Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers.

Tuesday, July 7

Hadoop Data Serialization Battle: Avro, Protocol Buffers, Thrift

My current research project involves processing large text corpora using Hadoop. I'll have more to say about the details in a future post. Right now, I want to focus on reading/writing custom data types in Hadoop.

For example, say you want to create an inverted index with a "Posting" data type that has a term (text), docId (long), and a array of occurrences (ints). One standard way is to write a class that implements Hadoop's Writable (and Comparable for keys) interfaces (see the Serialization section of Tom White's book). However, writing the classes is tedious, error prone, and a maintenance hassle which is done for every non-trivial value you want to write. If I can, I want to avoid this. Let's talk about other options.

The first two obvious choices are Facebook's Apache Thrift (JIRA patch code) and Google Protocol Buffers (JIRA patch). Both provide an Interface Definition Language (IDL) that is not language specific, which is nice. They plug into Hadoop via the Serializer package. You can also read Tom White's blog post from last year on the topic.

Of course, there is Streaming for non-Java jobs which serializes everything as a string via stdin and stdout. It introduces a 20-100% overhead (according to Yahoo!) in job execution performance. It's great for rapid prototyping, but maybe not as much for terabyte scale data processing!

Then of course, there is the new kid kid: Avro. Avro is a serialization and RPC framework created by Doug Cutting (Hadoop creator) after a hard look at PB and Thrift. It defines a data schema which is stored beside the data (not combined with it). The schemas are defined in JSON, so they are easy to parse in many languages. One big difference is that Avro doesn't generate code stubs; although you have the option to do so for statically typed languages.

As it happens, Doug just asked people to try out the 1.0 release candidate. It looks promising, there were several recent posts (one, two) by Johan Oskarsson from Last.Fm.

You should check out the data serialization benchmark. The benchmark results show that Avro is quite competitive already (except the timeCreate, which is odd).

In the meantime, comment and let me know your thoughts and experiences.