Thursday, May 29

June 2008 SIGIR Forum Highlights

The SIGIR bi-annual forum publication for June 2008 is now online.

It has extension coverage of ECIR, so if you missed the conference you can catch much of the relevant information.

(Somewhat) Grand Challenges for Information Retrieval, NickBelkin's ECIR keynote address, which I've covered in the past.

AIR 2006 First International Workshop on Adaptive Information Retrieval -- An older workshop on adapative retrieval (based on users). Really interesting topic, see the workshop website.

Beyond Bags of Words: Effectively Modeling Dependence and Features in Information Retrieval, Donald Metzler's dissertation abstract. And of course, if you find that interesting you can always read the entire thing: full thesis.

Wednesday, May 28

The most important research question in blog search

Tonight I watched On TREC Blog Track that I mentioned yesterday.
Social media isn't about search, it's about content creation and interaction. What is the role that search has to play here?
In the conclusion Ian describes possible future directions of the blog track:
  • Following a story different bloggers discuss it, track a story
  • Follow discussion of a single infamous blog post
  • Adaptive filtering of blogs
  • Automatic tagging of blog posts
To me, the first two directions sound like Topic Detection and Tracking (TDT) at different levels of granularity. Have blogs been used as a corpus for TDT tasks? I'm not too familiar with this area of research.

Another really interesting part is the Q&A. Specifically, Ian's response to the last question:
Search is a tool, not a task... Within the evaluation paradigm, I don't care how you find the ranked list of stuff... The problem of identifying the user task, what is the user trying to really do that we are abstracting and operationalizing into something you can measure in a lab setting. It's a critical question. It's something we have a standard of operationalizing; we have a standard way of making this an experiment in IR. This is how we have done search evaluation for a long time. So, we tend to try and cast problems in this way. But, one of the research questions, the most important research question is: how do you think about what people are actually doing and then how do you make this into something we can measure? This is what I am really interested in.
This sounds a lot like the discussion that Nick surfaced at ECIR, see previous discussion here and by Daniel.

One step towards measurement is to correlate search interaction log data with behavior observed in field studies. If you want to get started in this area I highly recommend "What people think about when searching" [pdf presentation, mp3 podcast] by Dan Russell from Google's search quality group, given at Marti Hearst's SIMS i141 class. It's one of the best presentations on HCIR I've heard in a long-time.

Tuesday, May 27

ICSWM: International Conference on Weblogs and Social Media 2008 videos are online

The ICWSM 2008 videos are online, via Matthew Hurst.

I couldn't attend the conference so I am keen on catching up. In my queue:

Any other must-watch videos from those who attended? Which are your favorites and why?

Another one which I finally found online is Machine Reading at Web Scale by Oren Etzioni from WSDM 2008.

I love

World changing research opportunities in Information Retrieval and beyond

Last night, I quoted Larry Page's description of Tesla and research with vision for changing the world. Daniel called me out and asked me to elaborate, here's what I came up with. I see the following as broad areas holding practical challenges for information retrieval and Human Language Technologies that require basic research. This is a big picture view, they need to be flushed out in more detail to turn into real and interesting research.

The Rise of Information Globalization
One of the biggest challenges of the 21st century will be the developing world joining the information economy. Asia now has more than twice as many Internet users as America and only a 15% penetration rate. There are a billion people in Africa and the penetration there is only 5.3% (World Internet Stats). A big challenge is getting these users online, as well as providing for their more basic non-technical needs [a must read is The End of Poverty by Jeffrey Sachs]. The way these users come online will mostly be through mobile phones and similar devices, and this will dramatically shape their online experience. See Vint Cerf's presentation, Tracking the Internet into the 21st century.

We need a reliable way to make our information accessible to those in developing countries and our survival in the new economy will depend on our ability to access and provide information in foreign languages. This is hard. Beyond that, if you want a BHAG: create a real-time Universal Translator that covers the majors world languages by 2025. Even something close would be world-changing.

On simple example of the implications of what's possible today in cross-language retrieval is Google's Translated Search. Search in your language, translate the query to a foreign language, and then return the translated results in your language. It's clearly a gen one product; hopefully we will have much betters tools soon. Cross-Language IR should be an interesting area of research in the years to come.

Accessing the past
Although we are now creating vast amounts of digital information, some of the most valuable is still analog and unstructured. The book scanning projects of Google, the Open Content Alliance, and others is a start, but there is a very long way to go. Imagine the entire Bodleian library at your fingertips, from any research institution on the planet. Now imagine the ability to search and process it with text processing algorithms. The potential for research here is astounding.

Using this information for text mining and retrieval at scale will also require robust OCR and handwriting recognition software capable of recognizing ancient handwriting, or we need to find other creative solutions. One example of such a creative solution is the retrieval of George Washington's handwritten letters at UMass. How can search and other Human Language Technologies lead to exciting next generation digital libraries, such as the Perseus Digital Library?

The rise of [semi] structured data
While most of the web is still unstructured information, like this blog post, that's changing. The future is structured, or at least partially structured, information. Current information extraction techniques, such as Fetch technologies' Agent Platform or UWashington's TextRunner, can turn unstructured into structured information, or even information fusion to create 'knowledge'.

The challenge here for IR systems is the seamless combination of keyword search with structured search. Faceted search, like that provided by Solr and Endeca are steps in the right direction. Next-gen systems might automatically convert keyword queries into queries containing structured as well as unstructured components. See some of previous posts on this topic (A database of everything, and IBM Avatar).

Two basic pieces of structure that today's search systems could do more with are temporality and geography (locality). Most of the information on the web today is young, but as it ages preserving and searching this information is critical. Many of my searches have a local and/or temporal nature, however, the tools for running these searches are very primitive.

Adaptive 'Autonomic' search systems
Today, building a search engine is a world-class search engine is a very hard challenge. One of the most difficult challenges is ranking. I have a vision of a search system that can actively use interaction information: click data, browsing behavior, explicit user feedback, etc... and improve ranking automatically with little or no developer involvement. Building such a system for the web would be especially hard because of the adversarial nature of search, but it might be easier in more controlled environments.

Machine learning algorithms are already being used to tune the ranking of the top search engines, see the papers at the Learning to Rank for IR workshop in 2007 and the upcoming 2008 workshop. The techniques described could be the beginning of systems capable of adapting in real-time. I know, it's a long way off, but that's why it's research.

How do our current algorithms scale? One of the underlying challenges to translation, extraction, and the other improvements I've outlined above is scale, both in storage and processing capability. To really make progress you need large quantities of noisy data. One thing that I admire about Larry's work is that he strained the existing systems and created Backrub to scale to the entire web. Today, accomplishing a similar feat means building systems and algorithms that can operate on a large distributed cluster using Hadoop, Amazon S3, and similar distributed processing systems. This is one of my biggest issues with a lot of academic research, it hasn't been tested at anything resembling real-world scale.

Future Fodder
I haven't talked about personalization, recommendations, topic alerts, social search, real-world user behavior and their search tasks, or the potential of the Semantic Web. More on those topics and IR at a later time, when it's not so late.

If you want other people's opinions about challenges in IR, you can read my previous post: Information Retrieval research challenges for 2008 and beyond.

An important lesson on research from Tesla

I was re-reading "The Search" by John Battelle over the weekend. I ran across some key insights inspired by the lesson of Nikola Tesla. Larry Page talks about the way Tesla inspired him:
"He had all these problems commercializing his work. It's a very sad story. I realized Tesla was the greatest inventor, but he didn't accomplish as much as he should have. I realized I wanted to invent things, but I also wanted to change the world. I wanted to get them out there, get them into people's hands so they can use them, because that's what really matters."
Speaking about Backrub, one of the first backlink graphs of the web, Page continues,
"My goals were to work on something that would be academically real and interesting. But there is no reason if you are doing academic work to work on things that are impractical. I wanted both, and I didn't think there was much of a trade-off to be made. I figure if I ended up building something that was going to potentially benefit a lot of people...then I would be open to commercializing it-so that I wouldn't be like Tesla."
This really struck me because I perceive much of the research in academic to be unrealistic with little real-world potential and often a lack desire from researchers to make their work into anything really applicable. While I doubt I will ever build something as significant as Google, I share Page's desire to do something both academically interesting and with the potential to benefit a lot of people.

I'm now looking for interesting research topics in this vein, if you have ideas, let's talk!

Monday, May 26

Steve Green's Minion talk at Harvard and first experiences with the source

Over the holiday weekend, I watched the talk on Minion Steve gave at Harvard. I expected more technical detail about Minion, but instead it was a more on overview of IR and related applications, past, present and future. Here are a few things I took away from the talk:

Steve talked about TREC and the fact that the test collections are static. He talked about the fact that real-world collections evolve. A potentially interesting avenue for research are dynamic test collections. It is interesting to model how often documents change and are created and the impact this has on precision and recall. Another interesting problem is that very recent documents don't have the same links or link text that older documents have. How should this be modeled for relevance?

He also talked about personalization in search and the use of Minion for content-based recommendation systems.

Personalization in search is a hot topic right now. How long are queries useful? What interests have local temporality (i.e. researching a trip) vs. longer term preference (say software engineering and cooking).

Steve spent quite a bit of time talking about collaborative filtering and recommendation engines. One memorable quote was, "recommendation is the new search." He talked about using Minion for the Aura recommendation engine. Paul Lamere used Minion to perform content-based similarity using the tags from for a music collection. Their system using Minion was the best in their test.

One of the questions at the end was about Minion vs. Lucene. Steve has written about this on his blog, but I found his brief answers informative:
  • Support for data types beyond String that enable parameter query operations on fields, for example date and numeric values
  • Minion has a English morphology engine to generate different word forms for query expansion out of the box.
  • Minion has a run-time configuration system configured with XML files, Lucene is configured with code.
A good video, overall.

After listening to the video, I downloaded the Minion source code and started poking around. I encountered a few minor hiccups. I couldn't find developer documentation, so I just went for it. First, I use Eclipse and the development team appears to use Netbeans, so I think I am encountering some platform issues. The Ant build script failed because of lack of JUnit on the classpath. For the normal Eclipse build, it is failing because the JavaCC generated parser classes are not present because they are built by the Ant script. I managed to get the Ant script to work and build a jar file so that the rest of the project compiled.

One thing that would be really useful is a good set of examples. How does the XML run-time configuration system work? Does Minion support document boosting similar to Lucene?

When I get a bit more time I'll give it more of a shot on some data I have lying around.