Wednesday, September 10

Trec 2009 Blog Track Thoughts

Iadh asked for ideas and comments for the 2009 TREC blog track.

First, I'm looking forward to the new blog corpus. The 2006 blog corpus is small and only covers eleven weeks. Hopefully, the new 2008 corpus will be much larger over a longer time frame that includes the upcoming US presidential election and all of the controversy surrounding it.

I read What Should Blog Search Look Like by Hearst, Hurst, and Dumais. The paper has three key tasks it goes over:
1. Find out what are people thinking or feeling about X
over time.
2. Find good blogs/authors to read.
3. Find useful information that was published in blogs
sometime in the past.
The paper focuses heavily on the search features needed to support these tasks. It's main criticism of the current blog distillation task (roughly task 2 above) is that the current task focuses only on relevance and does not incorporate information about the quality of the content or authority of the blog discovered.

I also read On the Trec Blog Track which summarizes the last two years of the blog track. It talks about an extension to the existing opinion finding track that I think could be really interesting:
For example, for a given product, one might wish to have a list of its positive and negative features, supported by a set of opinionated sentences extracted from
blogs (Popescu & Etzioni 2005). Such a task complements work in the TREC Question Answering track.
An interesting extension to this would be to try and summarize the positive and negative opinions on individual features.

I focused on the section Lessons Learnt and future tasks. The paper outlines three new possible tasks:
  • Feed/Information Filtering - Inform me of new feeds or new blog posts about X.
  • Story Detection - Identify all posts related to story X. A possible variant is to ask the participating systems to provide the top important stories or events for a given date or a given range of dates.
  • Information Leaders - Identify all information leaders about topic X
The first two sound very interesting. They sound similar to some of the tasks in the older Topic Detection and Tracking community (TDT) that was done with news data.

Personally, I really like the first two because I spend a lot of my time reading blogs of other leading tech leaders and researchers to stay on top of interesting topics in information retrieval and other related interested topics. The current alert systems (i.e. Google Alerts) are inadequate; they don't find all of the new information and often find many duplicates. A sub task here could be linking and deduping different versions of the same story.

For the second task, it's interesting to find all posts about a story. However, it isn't very realistic to find ALL posts. A primary reason is that not all posts are useful or worth reading. For example, a post may simply be a link to the story without any other content, this is relevant but not very useful. Again, I would like to incorporate some sense of quality: find the highest quality posts on story X.

To me the third task is slightly less interesting. However, it would be interesting to try and link the conversations together and track the discussion across blogs (including both comments and posts). The end goal might be to discover novel subgroups off the original story.

Through all of these one of the themes that sticks out is the need to find not just relevant information, but the need to discover posts or blogs that contain quality or authoritative content.

No comments:

Post a Comment