Tuesday, September 21

ECML PKDD 2010 Data Challenge: Measuring Web Data Quality

Yesterday the ECML PKDD Data Discovery challenge results were presented. See the website for the papers of the winning participants. The winning team used a bagged C4.5 decision tree for learning given the features.

A high level overview of the challenge from the website describe the challenge,
In this year's Discovery Challenge we target at more and different aspects. We want to develop site-level classification for the genre of the web sites (editorial, news, commercial, educational, "deep web", or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality.
The challenge dataset consists of 23M pages from 99K hosts in the .eu domain. Read the assessment guidelines.

The competition involves three tasks, see the full description of tasks. Here is a summary:

1. Classification task (English)
  • Web Spam
  • News/Editorial
  • Commercial
  • Educational/Research
  • Discussion
  • Personal/Leisure
  • Neutrality: from 3 (normal) to 1 (problematic)
  • Bias: 1 flags significant problems
  • Trustiness: from 3 (normal) to 1 (problematic)

2. Quality task (English)
Quality is measured as an aggregate function of genre, trust, factuality and bias and spam has lowest (0) quality

3. Multilingual quality task (German and French)
Same as task 2, but for non-English.

The interesting aspect of the challenge is that it moves away from spam/not spam labels to assessing more complex aspects of the quality of information.


  1. Help someone with a PKDD clue deficiency. Why do they measure a classification task with NDCG?

  2. I just came to your post and reading above thing it is very impressive me and it is very nice blog. Thanks a lot for sharing this.

  3. Anonymous5:34 AM EDT

    Thanks for sharing. How do you get the web data? Did you get that with web scraping tools like import.io, Octoparse(www.octoparse.com), visual web ribber? Or did you just write the python yourself?