Thursday, March 3

Google's War on Content Farms: Project Big Panda

In late February Google launched a significant update to its ranking algorithm to address "shallow content" pages. The change has been referred to as the "Farmer" update externally and internally it is known as "Panda".

Amit Singhal and Matt Cutts posted about the change on the Google blog, Finding more high quality sites in search. It reduced the rankings of "low quality sites" that aggregated content from other websites and didn't add a significant amount value to users. According to the post the update effected 11.8% of queries. They also launched the Chrome Blocklist Extension to let people block websites from their Google results. The O'Reilly Radar published an article with a very good overview of the discussion.

What is behind the change? The most informative article is a recent Wired interview by Stephen Levy, The ‘Panda’ That Hates Farms. It interview Matt Cutts and Amit Singhal who managed the update.

What was the answer? In short, they built a document quality classifier trained on lots of rater data. Here are some of the questions they asked raters from the article:
  • Would you be comfortable giving this site your credit card?
  • Would you be comfortable giving medicine prescribed by this site to your kids?
  • Do you consider this site to be authoritative?
  • Would it be okay if this was in a magazine?
  • Does this site have excessive ads?
These questions seem to ask about the authoritativeness and trust of the content on a page. The results were also confirmed by an 84% overlap between sites downgraded in the change and those that people blocked using the Chrome extension, even though it is not used as a feature in update.

How did Google become overrun with almost-spam content? Amit sheds a bit of light on the question in one of his answers:
So we did Caffeine in late 2009. Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.
The interview then gets bogged down in bigger issues around editorial process and transparency, which are important but not as technically interesting.

2 comments:

  1. Interesting assessment approach. Much more appealing than just training assessors to make a binary decision of whether a site is good or bad. There's so many facets to relevance, spaminess, and quality in general that binary (or even graded) relevance just doesn't give us enough information.

    ReplyDelete
  2. Nice project & excellent thinking for this work…….:)

    ReplyDelete