Wednesday, July 8

Citigroup Google and Bing Relevance study is bunk

SELand has a story on a recent study conducted by Citigroup analyst Mark Mahaney that compares the relevance of Google and Bing. AllThingsD also has coverage.
Over the past two weeks, we conducted 200 queries across the three major Search engines–Google, Yahoo! and Bing...After conducting the same query across all three Search sites, we picked a winner based on: 1) relevancy of the organic search results; and 2) robustness of the search experience, which included factors such as image and video inclusion, Search Assist, and Site Breakout...
According to the charts, Google returned the most relevant result 71 percent of the time, compared with Bing at 49 percent of the time and Yahoo 30 percent of the time.

Until I get more details on the study, which I wasn't able to find anywhere, I'm highly skeptical of the findings. It has serious flaws in its methodology.

Here are some reasons:
  1. Poor Metrics. The by "most relevant result" is not a good (or standard) metric. And "robustness of the search experience" is not clearly defined.

    He should've used standard metrics that people understand: precision@1, precision@3, or something similar. Or he could've conducted a user study and measure how long it took them to accomplish a standard task.

  2. "Relevance" is not defined. Binary, graded, etc... See the Google rater guidelines (summary).

  3. Annotator agreement. How many people rated each query? Could they agree with one another? Did they look at the result pages or just the snippets? These are important questions.

  4. Query selection. Taking queries from popular queries is very biased. A more realistic/random sample should be done, gathered from a company like Compete.
Mark and others doing tasks like this should consider Mechanical Turk (MT) and refer to Panos Ipeirotis for methods to handle noisy judgments. For example, if you wanted to evaluate relevance you could have raters judge pairs of documents to collect Pairwise Preference Judgments. For work using MT to collect judgments see Panos's post How good are you, Turker and a recent paper, Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers.

1 comment:

  1. Dolores Labs did just what you're suggesting with Mechanical Turk, but their results look like noise (as I commented in their comments section). [Ack, this interface isn't letting me paste, just look on; it's currently the most recent post.]

    I like Bing a lot and wrote up blog entry about it myself:

    The NY Times just ran a comparison today by David Pogue, who also liked Bing a lot.

    I've been using Bing for a month now, and the only place I've had to go back to Google is for computer-science-related topics, which they seem to index much much better than Bing.