Tuesday, October 20

Why I Don't Want Your Search Log Data

The IR field is largely driven by empirical experiments to validate theory. Today, one of the biggest perceived problems is that academia does not have access to the query and click log data collected by large web search engines. While this data is critical for improving a search service and useful for other interesting experiments, ultimately I believe it would lead to researchers being distracted by the wrong problems.

Data is limiting. Once you have it, you immediately start analyzing it and developing methods to improve relevance. For example, identifying navigational queries using click entropy. You can also apply supervised machine learning to rank documents and weight features. These are important and practical things to do if you run a service, but they aren't the fundamental problems that require research.

The IR community has it's own data: TREC. TREC deserves credit for driving significant improvements in IR technology in the early and mid 90s. However, it too can be limiting. For many in academia, success and failure is measured by TREC retrieval performance. Too often, a researcher struggles with superhuman effort to get incremental improvements on well-studied corpora that won't make a significant long-term contribution to the field. What's missing are the big leaps: disruptive innovation.

Academia should be building solutions for tomorrow's data, not yesterday's.

What will the queries and documents look like in 5 or even 10 years and how can we improve retrieval for those? It's not an easy question to answer, but you can watch Bruce Croft's CIKM keynote for some ideas. Without going into too much detail, also consider trends like cross-language retrieval, structured data, and search from mobile phones.

One proven pattern is that breakthroughs often come from synthesizing a model from a radically different domain. One recent intriguing direction is Keith van Rijsbergen's work on The Geometry of Information Retrieval applying models of quantum mechanics to describe document retrieval. Similarly, are there potential for models of information derived from molecular genetics and other fields? If you're a molecular geneticist and are interested in collaborating, e-mail me!

I still believe in empirical research. However, I'm also well-aware that over-reliance on limited data can lead to overfitting and incremental changes instead of ground-breaking research. To use an analogy from Wall Street, we become too focused on quarterly paper deadlines and lose sight of the fundamental science.

That said, if you want me to study your query logs... I'd be happy to do it. After all, I need those publications to graduate.

Am I wrong? I'm interested to hear your thoughts, tell me in the comments.

10 comments:

  1. When you say that "success and failure is measured by TREC retrieval performance", you have it somewhat backwards. In IR, success and failure is something you should be able to measure. If you can't measure it, you don't know if it works or not. If the data you have is not appropriate, you should gather better data. TREC is above all a framework for helping a community build data.

    ReplyDelete
  2. Ian, thanks for the comment. We agree on the need to measure success with retrieval effectiveness.

    TREC collections are often legacy constraints. We use existing collections because they are convenient rather than the most appropriate or compelling. I've fallen into this trap myself using blog opinion data when the task we were really attempting was new and there was no appropriate collection. I think in the end it weakened the research and we got distracted with nuances of blog retrieval.

    I also meant success or failure in the broader academic sense. A solid formula for publishing a paper is to extend an existing method and show that your new method gives a small, but "significant" gain in retrieval effectiveness on well-studied TREC collections. We do this over tackling the hard scientific problems with truly innovative methods.

    I'm encouraged that ClueWeb09 pushes the boundaries of scale. However, the corpus is a vanilla web crawl. It's almost two orders of magnitude smaller than the full web using by web engines. I'd like to see a focus on creating collections that are more unique and forward-looking: blending heterogeneous structured data with text, real-time documents, with decision making tasks that go beyond ad-hoc retrieval.

    ReplyDelete
  3. Jeff, Do you believe in sampling theory?

    ReplyDelete
  4. Nothing stops you from creating those collections, or even proposing how TREC might create those collections. There is even lots of new research on how to make collections more easily and cheaply.

    No one should incorrectly reuse an old collection to solve a new problem that it doesn't fit, then get a micro improvement in MAP which isn't comparable to any other work. We shouldn't publish that kind of research.

    ReplyDelete
  5. Iadh, what do you mean when you say sampling theory? I'm familiar with statistical sampling... but I don't think that's what you mean.

    Ian - You're right. It's something I'd be interested in discussing more.

    I agree the work shouldn't be published. I'm simply saying that the path of least resistance is to use a known collection with a familiar task, which restricts your thinking about problems to more traditional (boring) paths.

    ReplyDelete
  6. I totally agree with you, Jeff. However, I also realize that the whole academic way of publishing would collapse if we didn't have those datasets. Incremental research is the way to go, at least if you want to graduate within the next couple of years. With known datasets we don't have to start from scratch every time and save a lot of time in the engineering process. But we should still keep these great ideas at the back of our minds and work on them as `fun projects' as we get time...

    ReplyDelete
  7. One of the issues facing academic researchers in IR is the paucity of data on standard search tasks, compared to the enormous volumes of data available to operational test engines. Similarly, operational systems are able to perform large-volume live testing, which is essentially unavailable to academic researchers. So, whether or not useful innovative research can be done with existing data sets such as query logs, it is certainly true that academic researchers are going to lag far behind what commercial labs can achieve. I therefore agree with you that we would be better off concentrating on new ideas rather than new uses of old data.

    ReplyDelete
  8. It is the same. Do you believe in statistical sampling? Your comment on ClueWeb09 suggests that you are only interested in samples that have the same size as the population you are studying. This is questionable to say the at least. Unless you have evidence that the ClueWeb09 collection is not statistically representative??

    ReplyDelete
  9. Michael B.11:56 AM EDT

    I think equating between query logs and incremental research is a stretch. There is plenty of work on query logs that is incremental, of course, but there are also plenty of "distruptive research" that query logs helped create.

    Other than that, I agree with the basic premise of your article.

    ReplyDelete
  10. Iadh, my point is that acamedia won't do web search better than web search engines. In that respect, if we're using web data we should be performing novel tasks or focusing on efficiency.

    Michael, I'm not saying that query log research is incremental. However, academia is obsessed with it and it doesn't solve many of the fundamental problems we should be focusing on.

    ReplyDelete