Monday, May 26

Steve Green's Minion talk at Harvard and first experiences with the source

Over the holiday weekend, I watched the talk on Minion Steve gave at Harvard. I expected more technical detail about Minion, but instead it was a more on overview of IR and related applications, past, present and future. Here are a few things I took away from the talk:

Steve talked about TREC and the fact that the test collections are static. He talked about the fact that real-world collections evolve. A potentially interesting avenue for research are dynamic test collections. It is interesting to model how often documents change and are created and the impact this has on precision and recall. Another interesting problem is that very recent documents don't have the same links or link text that older documents have. How should this be modeled for relevance?

He also talked about personalization in search and the use of Minion for content-based recommendation systems.

Personalization in search is a hot topic right now. How long are queries useful? What interests have local temporality (i.e. researching a trip) vs. longer term preference (say software engineering and cooking).

Steve spent quite a bit of time talking about collaborative filtering and recommendation engines. One memorable quote was, "recommendation is the new search." He talked about using Minion for the Aura recommendation engine. Paul Lamere used Minion to perform content-based similarity using the tags from Last.fm for a music collection. Their system using Minion was the best in their test.

One of the questions at the end was about Minion vs. Lucene. Steve has written about this on his blog, but I found his brief answers informative:
  • Support for data types beyond String that enable parameter query operations on fields, for example date and numeric values
  • Minion has a English morphology engine to generate different word forms for query expansion out of the box.
  • Minion has a run-time configuration system configured with XML files, Lucene is configured with code.
A good video, overall.

After listening to the video, I downloaded the Minion source code and started poking around. I encountered a few minor hiccups. I couldn't find developer documentation, so I just went for it. First, I use Eclipse and the development team appears to use Netbeans, so I think I am encountering some platform issues. The Ant build script failed because of lack of JUnit on the classpath. For the normal Eclipse build, it is failing because the JavaCC generated parser classes are not present because they are built by the Ant script. I managed to get the Ant script to work and build a jar file so that the rest of the project compiled.

One thing that would be really useful is a good set of examples. How does the XML run-time configuration system work? Does Minion support document boosting similar to Lucene?

When I get a bit more time I'll give it more of a shot on some data I have lying around.

4 comments:

  1. Jeff -- I'm also curious about Minion and had a similar experience with the source not compiling. Would you mind posting what you did to get it to compile?

    ReplyDelete
  2. er... nevermind. I tracked down the faulty junit jar file path, in minion/nbproject/project.properties

    ReplyDelete
  3. Hey, guys. Sorry for the hiccups. We are indeed a Netbeans shop (no surprise there I guess!) We're all on the java.net mailing lists, so if you post questions or problems there we can help you out.

    I'm hoping to blog a bit more about the configuration system in the next couple of weeks (there's a lot to talk about.)

    If you want a flavor of what it's like, have a look at the configuration files in com.sun.labs.minion.*Config.xml

    ReplyDelete
  4. I successfully compiled Minion on a Ubuntu machine having the two following packages installed: ant & ant-optional.

    @stephen green: a simple step-by-step example would be a great extension to the current documentation.

    ReplyDelete