Friday, September 12

New Information Retrieval Group Blog: Probably Irrelevant

Jon introduced a new group blog for readers interested in Information Retrieval R&D: Probably Irrelevant. I like the name ;-)

Fernando Diaz, a recent CIIR alumnus has the first post: Blogs, queries, corpora. He's continuing the discussion that Iadh started on tasks for the TREC 2009 blog track (see my earlier post in response). Fernando focuses on the origins of the current TREC tasks and deriving future tasks from the behavior of real-world users of blog search engines. Fernando writes,
One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers?... One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios.
He poses a very good question. I don't recall seeing any published research analyzing the behavior of users with blog search log data. Ultimately, the problem comes back to a fundamental issue that academia struggles to try and create relevant and realistic test scenarios without access to log data from real-world systems. However, hopefully we can at least try to improve what we have today.

I would like to see TREC topics begin to model the interactive nature of search. A starting pointing is acknowledging that users enter multiple queries in order to find information. Today, TREC topics are only a single query, which is unrealistic and overly simplistic. As a starting point, I advocate the development of multi-query topics developed from query refinement chains. Evaluation would be performed on each query in the chain and the results for the query chain combined. Thoughts?

Wednesday, September 10

Trec 2009 Blog Track Thoughts

Iadh asked for ideas and comments for the 2009 TREC blog track.

First, I'm looking forward to the new blog corpus. The 2006 blog corpus is small and only covers eleven weeks. Hopefully, the new 2008 corpus will be much larger over a longer time frame that includes the upcoming US presidential election and all of the controversy surrounding it.

I read What Should Blog Search Look Like by Hearst, Hurst, and Dumais. The paper has three key tasks it goes over:
1. Find out what are people thinking or feeling about X
over time.
2. Find good blogs/authors to read.
3. Find useful information that was published in blogs
sometime in the past.
The paper focuses heavily on the search features needed to support these tasks. It's main criticism of the current blog distillation task (roughly task 2 above) is that the current task focuses only on relevance and does not incorporate information about the quality of the content or authority of the blog discovered.

I also read On the Trec Blog Track which summarizes the last two years of the blog track. It talks about an extension to the existing opinion finding track that I think could be really interesting:
For example, for a given product, one might wish to have a list of its positive and negative features, supported by a set of opinionated sentences extracted from
blogs (Popescu & Etzioni 2005). Such a task complements work in the TREC Question Answering track.
An interesting extension to this would be to try and summarize the positive and negative opinions on individual features.

I focused on the section Lessons Learnt and future tasks. The paper outlines three new possible tasks:
  • Feed/Information Filtering - Inform me of new feeds or new blog posts about X.
  • Story Detection - Identify all posts related to story X. A possible variant is to ask the participating systems to provide the top important stories or events for a given date or a given range of dates.
  • Information Leaders - Identify all information leaders about topic X
The first two sound very interesting. They sound similar to some of the tasks in the older Topic Detection and Tracking community (TDT) that was done with news data.

Personally, I really like the first two because I spend a lot of my time reading blogs of other leading tech leaders and researchers to stay on top of interesting topics in information retrieval and other related interested topics. The current alert systems (i.e. Google Alerts) are inadequate; they don't find all of the new information and often find many duplicates. A sub task here could be linking and deduping different versions of the same story.

For the second task, it's interesting to find all posts about a story. However, it isn't very realistic to find ALL posts. A primary reason is that not all posts are useful or worth reading. For example, a post may simply be a link to the story without any other content, this is relevant but not very useful. Again, I would like to incorporate some sense of quality: find the highest quality posts on story X.

To me the third task is slightly less interesting. However, it would be interesting to try and link the conversations together and track the discussion across blogs (including both comments and posts). The end goal might be to discover novel subgroups off the original story.

Through all of these one of the themes that sticks out is the need to find not just relevant information, but the need to discover posts or blogs that contain quality or authoritative content.

TREC Blog Search: 2008 and beyond

Iadh over on Terrier Team has an update on the TREC 2008 blog track and is asking for thoughts and comments for the proposed 2009 edition. Please go comment or e-mail him.

To start, he gives a brief history of the track over the past three years it's run.
Our main findings and conclusions from the first two years of the Blog track at TREC are summarised in the ICWSM 2008 paper, entitled On the Trec Blog Track. The Blog track 2006 and 2007 overview papers provide further detailed analysis and results.

He also points to a position paper by Marti Hearst, et. al., What Should Blog Search Look Like? that will be presented at CIKM 2008.

I will read both papers and give it some thought. You will hear from me soon ;-).

Monday, September 8

Solving Search: A Game of Guess the Magic Words

Maybe sometimes when you are searching for information you feel like Popeye trying to open the magic cave door to rescue Olive Oyl from the forty thieves:
I wonder what words he used when he opened this door? Open sissy, open cecil, no that can't be it! whoop, it's giving way, it's giving way...
- Popeye watch the full episode via the IA (or jump to the second half on YouTube)
Unfortunately, Popeye's super strength won't help you with your search tasks. Or perhaps you feel like the bumbling fool Hasan attempting to prevent Daffy from stealing the master's treasure:

Using today's search engines is playing the game of Guess the Magic Words. Guess the right words, and Open Sesame! Guess poorly and you bang your head against the cave door for minutes or hours. How good are you?

A current problem for search engines today is this: If you type a long query and give the search engine more information about your information need you are likely to get worse results than if you entered only a few brief keywords.

Barney Pell from PowerSet describes the current language of search engines as keywordese. If you enter too few of these keywords your query is likely too vague; too many keywords and relevant documents are mistakenly filtered out. And too often we don't know the 'magic words' to find the desired information.

Until search engines utilize the information in long queries without being overwhelmed by the 'noise' search will remain broken.

Marissa Mayer said in a recent LA Times Article on Google's 10 year anniversary:
I think there will be a continued focus on innovation, particularly in search. Search is an unsolved problem. We have a good 90 to 95% of the solution, but there is a lot to go in the remaining 10%.
Search isn't 90% solved. It's not easy to quantify because search is constantly evolving. Regardless, beating the Guess the Magic Words level in the game of search is still a long way off.