Tuesday, May 27

World changing research opportunities in Information Retrieval and beyond

Last night, I quoted Larry Page's description of Tesla and research with vision for changing the world. Daniel called me out and asked me to elaborate, here's what I came up with. I see the following as broad areas holding practical challenges for information retrieval and Human Language Technologies that require basic research. This is a big picture view, they need to be flushed out in more detail to turn into real and interesting research.

The Rise of Information Globalization
One of the biggest challenges of the 21st century will be the developing world joining the information economy. Asia now has more than twice as many Internet users as America and only a 15% penetration rate. There are a billion people in Africa and the penetration there is only 5.3% (World Internet Stats). A big challenge is getting these users online, as well as providing for their more basic non-technical needs [a must read is The End of Poverty by Jeffrey Sachs]. The way these users come online will mostly be through mobile phones and similar devices, and this will dramatically shape their online experience. See Vint Cerf's presentation, Tracking the Internet into the 21st century.

We need a reliable way to make our information accessible to those in developing countries and our survival in the new economy will depend on our ability to access and provide information in foreign languages. This is hard. Beyond that, if you want a BHAG: create a real-time Universal Translator that covers the majors world languages by 2025. Even something close would be world-changing.

On simple example of the implications of what's possible today in cross-language retrieval is Google's Translated Search. Search in your language, translate the query to a foreign language, and then return the translated results in your language. It's clearly a gen one product; hopefully we will have much betters tools soon. Cross-Language IR should be an interesting area of research in the years to come.

Accessing the past
Although we are now creating vast amounts of digital information, some of the most valuable is still analog and unstructured. The book scanning projects of Google, the Open Content Alliance, and others is a start, but there is a very long way to go. Imagine the entire Bodleian library at your fingertips, from any research institution on the planet. Now imagine the ability to search and process it with text processing algorithms. The potential for research here is astounding.

Using this information for text mining and retrieval at scale will also require robust OCR and handwriting recognition software capable of recognizing ancient handwriting, or we need to find other creative solutions. One example of such a creative solution is the retrieval of George Washington's handwritten letters at UMass. How can search and other Human Language Technologies lead to exciting next generation digital libraries, such as the Perseus Digital Library?

The rise of [semi] structured data
While most of the web is still unstructured information, like this blog post, that's changing. The future is structured, or at least partially structured, information. Current information extraction techniques, such as Fetch technologies' Agent Platform or UWashington's TextRunner, can turn unstructured into structured information, or even information fusion to create 'knowledge'.

The challenge here for IR systems is the seamless combination of keyword search with structured search. Faceted search, like that provided by Solr and Endeca are steps in the right direction. Next-gen systems might automatically convert keyword queries into queries containing structured as well as unstructured components. See some of previous posts on this topic (A database of everything, and IBM Avatar).

Two basic pieces of structure that today's search systems could do more with are temporality and geography (locality). Most of the information on the web today is young, but as it ages preserving and searching this information is critical. Many of my searches have a local and/or temporal nature, however, the tools for running these searches are very primitive.

Adaptive 'Autonomic' search systems
Today, building a search engine is a world-class search engine is a very hard challenge. One of the most difficult challenges is ranking. I have a vision of a search system that can actively use interaction information: click data, browsing behavior, explicit user feedback, etc... and improve ranking automatically with little or no developer involvement. Building such a system for the web would be especially hard because of the adversarial nature of search, but it might be easier in more controlled environments.

Machine learning algorithms are already being used to tune the ranking of the top search engines, see the papers at the Learning to Rank for IR workshop in 2007 and the upcoming 2008 workshop. The techniques described could be the beginning of systems capable of adapting in real-time. I know, it's a long way off, but that's why it's research.

How do our current algorithms scale? One of the underlying challenges to translation, extraction, and the other improvements I've outlined above is scale, both in storage and processing capability. To really make progress you need large quantities of noisy data. One thing that I admire about Larry's work is that he strained the existing systems and created Backrub to scale to the entire web. Today, accomplishing a similar feat means building systems and algorithms that can operate on a large distributed cluster using Hadoop, Amazon S3, and similar distributed processing systems. This is one of my biggest issues with a lot of academic research, it hasn't been tested at anything resembling real-world scale.

Future Fodder
I haven't talked about personalization, recommendations, topic alerts, social search, real-world user behavior and their search tasks, or the potential of the Semantic Web. More on those topics and IR at a later time, when it's not so late.

If you want other people's opinions about challenges in IR, you can read my previous post: Information Retrieval research challenges for 2008 and beyond.


  1. OK, now I can't complain about your lack of specificity. :)

    I'll admit that I don't get very excited about Cross-Language IR, but perhaps that's because I don't know what I'm missing. But, more generally, I suspect CLIR has the most near-term value for searchers who want access to English-language documents but are not fluent in English. Cultural imperialism does have its perks.

    Re accessing the past: the most promising development I've seen in this direction is reCAPTCHA. I wonder if anyone has used reCAPTCHA as training for machine learning approaches.

    Temporal/spatial search is interesting, but I'm not sure what the research problems are, beyond information extraction to correctly associate a document with a time and location.

    My reaction on adversarial search: I wish we could stop the arms race. I've ranted about this before.

    Finally, scale seems more like a consideration than a research area in and of itself--unless you decide to focus on the systems side of IR.

  2. Jeff -- great post. Good luck narrowing your focus :)

    Re. temporal/spatial problems -- this information is critical to news, blogs & local search, but we don't see many people giving space & time attention in academia. Research questions include: how can we tell a query is time or location sensitive? What are effective ways to trade-off topical relevance & temporal recency when ranking news articles? How does modeling temporal bursts of topics in blogs affect retrieval?

    Part of academia's lack of focus on this is the difficulty in evaluating without a web scale system. For tractable evaluations, we treat our collections as static and our queries as independent of any temporal or local information (except for TDT). I've been involved in an effort to create a test collection for topic tracking over time, and coming up with reasonable information needs involving unfolding news stories is an extremely difficult task.

    Re. semi-structured data -- this goes well beyond just extraction of structured information from text. I'm looking at structure at a document and collection level. Many of our collections are *already* structured and, with the exception of hyperlinks, we don't know how to use a lot of this to help retrieval-- metadata like authorship & creation time, document tagging & classification, site & sub-site structure, thread structure in emails and online forums, and the list goes on.

  3. Daniel -

    I too have a hard time getting excited about cross-lingual IR. I'm more interested in the translation angle. Could IR tools be applied to improve translation?

    As far as scale, I think it's really important. As a researcher/developer wanting to change the world - yes I care about building real systems.It also has research implications. For example, Peter Norvig has given presentations on how as collection size increases that many simple algorithms can outperform fancier techniques. In short, I want to make sure the research I develop actually matters to people building real systems and not waste my time on fancy algorithms that don't make a significant difference.

    Jon -
    Great insights, thanks for the comments. You hit the nail on the head with some of the temporal aspects.

    I'm very interested in your work developing a test collection for topic tracking. If there's anything I can do, I'd love to help.

    Structured data - You're right, we don't leverage the structure we already have to its fullest advantage.

    However, we also need to take advantage of the deeper structure. One of the interesting challenges is that to for this structure to be useful you need to do domain-specific data normalization and data integration. Doing this in one domain is time-consuming, doing it across many domains at web-scale is really hard. One reason this has failed in the past is that extraction error + integration error + retrieval error leads to worse results than retrieval alone. Making progress requires digging in and getting your hands dirty across all levels, something that most people can't or won't do.

  4. "I'm very interested in your work developing a test collection for topic tracking. If there's anything I can do, I'd love to help."

    Sorry Jeff, that project is dead. You can read about it in Yiming Yang's SIGIR paper from last year. I only had a bit part to play in the corpus creation.