Friday, October 23

Conferences Coverage: RecSys09 and HCIR09

I'm not attending either, but trying to follow what's going on.

The 2009 conference on recommendation systems in NY is happening this weekend. Follow the conference on Twitter, #recsys09. I'm particularly looking for coverage on the Netflix Challenge panel: What did we learn from the Netflix Prize? Perspectives from some of the leading contestants.

The HCIR Workshop is also taking place in DC. Daniel is one of the chairs. You can also see other coverage on #hcir09. The proceedings for the workshop are available. Henry is attending and taking part in a panel, so hopefully I'll be able to share some of his highlights.

Tuesday, October 20

Why I Don't Want Your Search Log Data

The IR field is largely driven by empirical experiments to validate theory. Today, one of the biggest perceived problems is that academia does not have access to the query and click log data collected by large web search engines. While this data is critical for improving a search service and useful for other interesting experiments, ultimately I believe it would lead to researchers being distracted by the wrong problems.

Data is limiting. Once you have it, you immediately start analyzing it and developing methods to improve relevance. For example, identifying navigational queries using click entropy. You can also apply supervised machine learning to rank documents and weight features. These are important and practical things to do if you run a service, but they aren't the fundamental problems that require research.

The IR community has it's own data: TREC. TREC deserves credit for driving significant improvements in IR technology in the early and mid 90s. However, it too can be limiting. For many in academia, success and failure is measured by TREC retrieval performance. Too often, a researcher struggles with superhuman effort to get incremental improvements on well-studied corpora that won't make a significant long-term contribution to the field. What's missing are the big leaps: disruptive innovation.

Academia should be building solutions for tomorrow's data, not yesterday's.

What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It's not an easy question to answer, but you can watch Bruce Croft's CIKM keynote for some ideas. Without going into too much detail, also consider trends like cross-language retrieval, structured data, and search from mobile phones.

One proven pattern is that breakthroughs often come from synthesizing a model from a radically different domain. One recent intriguing direction is Keith van Rijsbergen's work on The Geometry of Information Retrieval applying models of quantum mechanics to describe document retrieval. Similarly, are there potential for models of information derived from molecular genetics and other fields? If you're a molecular geneticist and are interested in collaborating, e-mail me!

I still believe in empirical research. However, I'm also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

That said, if you want me to study your query logs... I'd be happy to do it. After all, I need those publications to graduate.

Am I wrong? I'm interested to hear your thoughts, tell me in the comments.