Data is limiting. Once you have it, you immediately start analyzing it and developing methods to improve relevance. For example, identifying navigational queries using click entropy. You can also apply supervised machine learning to rank documents and weight features. These are important and practical things to do if you run a service, but they aren't the fundamental problems that require research.
The IR community has it's own data: TREC. TREC deserves credit for driving significant improvements in IR technology in the early and mid 90s. However, it too can be limiting. For many in academia, success and failure is measured by TREC retrieval performance. Too often, a researcher struggles with superhuman effort to get incremental improvements on well-studied corpora that won't make a significant long-term contribution to the field. What's missing are the big leaps: disruptive innovation.
Academia should be building solutions for tomorrow's data, not yesterday's.
What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It's not an easy question to answer, but you can watch Bruce Croft's CIKM keynote for some ideas. Without going into too much detail, also consider trends like cross-language retrieval, structured data, and search from mobile phones.
One proven pattern is that breakthroughs often come from synthesizing a model from a radically different domain. One recent intriguing direction is Keith van Rijsbergen's work on The Geometry of Information Retrieval applying models of quantum mechanics to describe document retrieval. Similarly, are there potential for models of information derived from molecular genetics and other fields? If you're a molecular geneticist and are interested in collaborating, e-mail me!
I still believe in empirical research. However, I'm also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.
That said, if you want me to study your query logs... I'd be happy to do it. After all, I need those publications to graduate.
Am I wrong? I'm interested to hear your thoughts, tell me in the comments.