Last night, I quoted Larry Page's description of Tesla and research with vision for changing the world. Daniel called me out and asked me to elaborate, here's what I came up with. I see the following as broad areas holding practical challenges for information retrieval and Human Language Technologies that require basic research. This is a big picture view, they need to be flushed out in more detail to turn into real and interesting research.
The Rise of Information Globalization
One of the biggest challenges of the 21st century will be the developing world joining the information economy. Asia now has more than twice as many Internet users as America and only a 15% penetration rate. There are a billion people in Africa and the penetration there is only 5.3% (World Internet Stats). A big challenge is getting these users online, as well as providing for their more basic non-technical needs [a must read is The End of Poverty by Jeffrey Sachs]. The way these users come online will mostly be through mobile phones and similar devices, and this will dramatically shape their online experience. See Vint Cerf's presentation, Tracking the Internet into the 21st century.
We need a reliable way to make our information accessible to those in developing countries and our survival in the new economy will depend on our ability to access and provide information in foreign languages. This is hard. Beyond that, if you want a BHAG: create a real-time Universal Translator that covers the majors world languages by 2025. Even something close would be world-changing.
On simple example of the implications of what's possible today in cross-language retrieval is Google's Translated Search. Search in your language, translate the query to a foreign language, and then return the translated results in your language. It's clearly a gen one product; hopefully we will have much betters tools soon. Cross-Language IR should be an interesting area of research in the years to come.
Accessing the past
Although we are now creating vast amounts of digital information, some of the most valuable is still analog and unstructured. The book scanning projects of Google, the Open Content Alliance, and others is a start, but there is a very long way to go. Imagine the entire Bodleian library at your fingertips, from any research institution on the planet. Now imagine the ability to search and process it with text processing algorithms. The potential for research here is astounding.
Using this information for text mining and retrieval at scale will also require robust OCR and handwriting recognition software capable of recognizing ancient handwriting, or we need to find other creative solutions. One example of such a creative solution is the retrieval of George Washington's handwritten letters at UMass. How can search and other Human Language Technologies lead to exciting next generation digital libraries, such as the Perseus Digital Library?
The rise of [semi] structured data
While most of the web is still unstructured information, like this blog post, that's changing. The future is structured, or at least partially structured, information. Current information extraction techniques, such as Fetch technologies' Agent Platform or UWashington's TextRunner, can turn unstructured into structured information, or even information fusion to create 'knowledge'.
The challenge here for IR systems is the seamless combination of keyword search with structured search. Faceted search, like that provided by Solr and Endeca are steps in the right direction. Next-gen systems might automatically convert keyword queries into queries containing structured as well as unstructured components. See some of previous posts on this topic (A database of everything, and IBM Avatar).
Two basic pieces of structure that today's search systems could do more with are temporality and geography (locality). Most of the information on the web today is young, but as it ages preserving and searching this information is critical. Many of my searches have a local and/or temporal nature, however, the tools for running these searches are very primitive.
Adaptive 'Autonomic' search systems
Today, building a search engine is a world-class search engine is a very hard challenge. One of the most difficult challenges is ranking. I have a vision of a search system that can actively use interaction information: click data, browsing behavior, explicit user feedback, etc... and improve ranking automatically with little or no developer involvement. Building such a system for the web would be especially hard because of the adversarial nature of search, but it might be easier in more controlled environments.
Machine learning algorithms are already being used to tune the ranking of the top search engines, see the papers at the Learning to Rank for IR workshop in 2007 and the upcoming 2008 workshop. The techniques described could be the beginning of systems capable of adapting in real-time. I know, it's a long way off, but that's why it's research.
How do our current algorithms scale? One of the underlying challenges to translation, extraction, and the other improvements I've outlined above is scale, both in storage and processing capability. To really make progress you need large quantities of noisy data. One thing that I admire about Larry's work is that he strained the existing systems and created Backrub to scale to the entire web. Today, accomplishing a similar feat means building systems and algorithms that can operate on a large distributed cluster using Hadoop, Amazon S3, and similar distributed processing systems. This is one of my biggest issues with a lot of academic research, it hasn't been tested at anything resembling real-world scale.
I haven't talked about personalization, recommendations, topic alerts, social search, real-world user behavior and their search tasks, or the potential of the Semantic Web. More on those topics and IR at a later time, when it's not so late.
If you want other people's opinions about challenges in IR, you can read my previous post: Information Retrieval research challenges for 2008 and beyond.