Friday, September 12

New Information Retrieval Group Blog: Probably Irrelevant

Jon introduced a new group blog for readers interested in Information Retrieval R&D: Probably Irrelevant. I like the name ;-)

Fernando Diaz, a recent CIIR alumnus has the first post: Blogs, queries, corpora. He's continuing the discussion that Iadh started on tasks for the TREC 2009 blog track (see my earlier post in response). Fernando focuses on the origins of the current TREC tasks and deriving future tasks from the behavior of real-world users of blog search engines. Fernando writes,
One question I hope will be resolved in the comments is where these query types came from. Are they derived from actual blog searchers?... One approach would be to inspect query logs to blog search engines for different retrieval scenarios and then improve performance for those scenarios.
He poses a very good question. I don't recall seeing any published research analyzing the behavior of users with blog search log data. Ultimately, the problem comes back to a fundamental issue that academia struggles to try and create relevant and realistic test scenarios without access to log data from real-world systems. However, hopefully we can at least try to improve what we have today.

I would like to see TREC topics begin to model the interactive nature of search. A starting pointing is acknowledging that users enter multiple queries in order to find information. Today, TREC topics are only a single query, which is unrealistic and overly simplistic. As a starting point, I advocate the development of multi-query topics developed from query refinement chains. Evaluation would be performed on each query in the chain and the results for the query chain combined. Thoughts?


  1. It's a good question, but it's a misleading one. Actually, the tasks are motivated by a commercial blog-search query log [4]. All opinion retrieval queries were extracted from another commercial blog search engine query log [1,2,3].

    Further reading:
  2. Thanks for the coverage Jeff.

  3. Craig, thanks for the follow-up. The Study of Blog Search is interesting. I would like to see a more recent study with logs from a more mainstream engine (Technorati or Google Blog Search).

    It sounds like this area (task creation, origins of the blog track, etc...)would be a good post for the Terrier Team blog.

    I knew the queries were derived from real query logs for the opinion task. I guess I wonder if the real user's intent was opinion finding or maybe they were looking for news. Did you try and separate/identify these?

    What about the distillation task? It's probably there too if I read more ;-).