Wednesday, March 28

Open Source Search Engines, Retrieval Tools and Libraries

Updated 8/19/2009: The tools on this page are still pretty current. It's worth noting that you may wish to see Vik Singh's comparison of open source search engines. I do have issues with his results: I know from significant experience that Lucene does not have good relevancy out of the box and since it appears to come out on top in this area, I'm very skeptical of the results.

After the popularity of my previous post on open source text mining tools, today I thought I would follow up with my list of open source information retrieval (search) libraries. Here is my short list of the most important open source [free] information retrieval libraries used today. I grouped them roughly by their use in industry vs academia.

Industrial
Lucene - The de-facto commercial standard for search. Lucene is the most widely used information retrieval library in industry. It is in written in Java (although there are now ports in C, C#, Perl, Python, Ruby, and others). Lucene was originally written by Doug Cutting and is an Apache project. Lucene's biggest strength is its very active developer community. It's biggest weaknesses are ranking and scalability. It's ranking model uses a variation of the TF-IDF vector space model and enforces boolean constraints. It is a bit outdated and requires significant tweaking to produce relevant results, but it can be done, see my article Lucene Relevance beats state-of-the-art for details. Lucene also has performance and scalability issues on web-scale document collections. Lucene starts to hit its limit at approximately 5-10 million web documents per commodity web server; see the Hurricane Katrina discussion on the Lucene mailing list.

It's important to realize that Lucene is a IR library, not a standalone search engine, for that you need Nutch. Because of this, the systems a search engine needs: crawlers, document converters, linguistic analysis tools, and similar plug-ins are not ready out of the box.

Lucene has widespread industry adoption. It is used by Technorati, Monster.com's resume search [announcement on Lucene List], Amazon's Search Inside This Book, and many more. Lucene is the core library of the Nutch open source search engine which powers Krugle. Lucene also powers Solr, a faceted search system donated by CNet. Solr powers CNet's product search. In short, Lucene is a mature and robust IR platform. It is a great choice if you have a small to medium sized data set that needs indexing. It uses the Apache License.

Terrier - TERabyte RetrIEvER from the Information Retrieval Group at the University of Glasgow. Terrier is both an academic and commercial engine. It was originally designed as a research platform for relevance ranking methods. Specifically, it is a probabilistic engine that uses the "Divergence from randomness" (DFR) model; although it supports many of the common relevance implementations including the standard TF-IDF and BM25 models. There is a paper from OSIR 2006 that describes it in more detail. The latest release, Terrier 2.2 supports distributed indexing via Hadoop. Don't miss the Terrier Team's blog. It is released under the Mozilla license.

Hounder - Technically, this could also be grouped with Lucene. Hounder is a complete out of the box search engine by Flaptor. It's written in Java and includes a distributed focused crawler (that includes a classifier), indexing, and search system. It's most similar to Solr and Nutch, see their comparison. Hounder powers Wordpress.com's search capability. Flaptor also claims they have a 300 million document collection running on approximately 30 nodes. They released their cluster management system as Clusterfest.

Xapian - Is an engine written in C++ with a probablistic ranking system. It was originally Open Muscat, but developed at Cambridge University by Dr. Martin Porter (of Porter Stemmer fame). Xapian is the distant offspring of this engine. See its history page for more on its turbulent past. It has commercial support available through two consulting firms who contribute to the project. I'm not too familiar with this engine, but it apparently has several successful deployments in the enterprise search space.

Research platforms
Galago - A Java based search engine from Trevor Strohman, who recently graduated from UMass Amherst and is now develops infrastructure at Google. Trevor wrote Galago as part of his thesis. Here is his description:
It includes a distributed computation framework called TupleFlow which is an extension of MapReduce. In addition, it can build three different kinds of indexes, two of which are used in my dissertation, and a third kind which supports a subset of the Indri query language.
From what I understand, Galago is still early in its development. However, it is being used as the platform for the new IR textbook: Search Engines: Information Retrieval In Practice due out in early 2009.

Indri & Lemur - A joint project between UMass's CIIR and CMU. Indri is the search engine in the Lemur language modeling toolkit. It was developed as a platform for experimentation with ranking algorithms, specifically Language Modeling and Inference Networks. The two primary developers are Trevor Strohman and Paul Olgilvie. They gave a tutorial at SIGIR 2006 this past summer, the slides are online. They also have their TREC 2006 paper online, Indri at TREC 2006: Lessons Learned from Three Terabyte Tracks. It has a BSD-inspired license.

Minion - A new open source Java search engine written by Steve Green and Jeff Alexander from Sun Labs. Minion powers the search capability of Sun's portal server. The description from their recent JavaOne talk:
Minion is a capable full text search engine that provides integrated boolean, relational and proximity querying. Because Minion was developed as a research engine, it is designed to be highly configurable at runtime so that the user can decide which features and capabilities he needs for a particular job.
The closest competitor is Lucene. Steve has a whole series of articles comparing Minion and Lucene.

MG4J - Managing Gigabytes for Java developed by Sebastiano Vigna and Paolo Boldi from the University of Milano in Italy. From their description: MG4J is a framework for building indices of large document collection based on the classical inverted-index approach. The kind of index constructed is very configurable (e.g., you can choose your preferred coding method), and moreover some new research has gone into providing efficient skips and minimal-interval semantics. It supports flexible scoring schemes, including BM25, and a variety of posting list representations to balance performance and flexibility. It is distributed under the lesser GNU GPL license.

Wumpus - A project from the University of Waterloo, namely Charles Clarke and Stefan Buttcher. From their description:
One particular scenario that we are studying is file system search (aka "desktop search"), in which the underlying text collection is very dynamic and the number of expected index update operations is much greater than the number of search queries submitted by the users of the system.
However, Wumpus also seems to perform reasonably well on web documents in the TREC Terabyte track competitions. For a good overview of some of their lessons from TREC see: Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval (TREC 2005).

Zettair - From those Aussie's at RMIT down under (including Justin Zobel and Alistair Moffat). It's main emphasis is research on the performance and scalability of search systems. It is written in C. Moffat and others use it as a platform for their work on novel index compression schemes and impact sorted indexes. It scales to very large document collections. It is released under a BSD-style license.

Search library comparisons
There is a good comparison of the performance of the Zettair, Wumpus, and Indri in the TREC 2006 Terabyte Track paper.

Also, Christian Middleton of UPF and Ricardo Baeza-Yates from UPF/Yahoo! somewhat recently published A Comparison of Open Source Search Engines. It's a good start, but more details on their experimental methodology (i.e. system configurations) would be helpful. Grant Ingersoll, a Lucene comitter, replied in follow-up blog post and started an interesting discussion.

Top industrial choices
If you need a simple solution for small to medium scale deployments I highly recommend Lucene, although it requires getting your hands dirty to make it work well. There are also a number of companies that provide Lucene support. Minion would be another choice here to consider, but it is less mature and lacks commercial support.

For larger scale retrieval I would recommend a Lucene derivative or possibly Terrier. However, in my experience, if you are trying to build Google/Yahoo/Microsoft scale search engine none of these will do the job in a cost-effective manner.

Top academic choices
My top choices here are Indri/Lemur, Terrier, Galago, and Zettair. Disclosure: I started attending UMass, which developed Indri and Galago.

Updated 4-19-2008: Added Xapian to the industrial list and Galago and MG4J to the list of research engines. Updated 5-18-2008: Added Minion and Hounder. Updated 1-17-2009: Updated descriptions based on recent releases.

9 comments:

  1. Oh, I didn't know Amazon's search inside the book uses Lucene. Where did you see they use Lucene for that?
    Simpy uses Lucene, too, of course.

    ReplyDelete
  2. Hi Otis,

    I talked to some of the Amazon developers at SIGIR 2006. They were giving out free t-shirts at their booth :-).

    ReplyDelete
  3. Hi Jeff,

    Thanks for posting this list. Two quick comments...

    1. I'd also mention that Lucene is used by Solr, a cool enterprise search server (used by CNET, Krugle and others).

    2. The maximum usable size of a single Lucene index has many free variables. For a typical Nutch-generated index, I think the upper bounds of 10-20M documents (assuming standard hardware) is about right. Document size, number of fields, complexity of the query, and amount of RAM are among the factors that can make this number go up or down.

    Lucene does support merging results from multiple indexes, and adjusting for IDF skew in the process. The main problem here (IMO) with effectively using this support is that the operational support (e.g. code/scripts for managing federated searchers) doesn't really exist.

    ReplyDelete
  4. Do you know of existing projects that are dedicated to crawl only certain document type: PDF, DOC and PPT?

    I'm interested in building a web site that would allow the community of researchers, lawyers, etc. directly search these document types.

    I know Google has a filetype: filter, but I'd like to do more around the document: a la Digg, bookmarking them, discussing them, etc.

    ReplyDelete
  5. Anonymous3:57 PM EDT

    TopX would be another engine from the research world that you could add...

    It is now available at:

    http://topx.sourceforge.net/

    ReplyDelete
  6. Anonymous11:38 AM EDT

    Jeff,

    Thanks for putting this site together - very informative.

    I wanted to get your thoughts on one mof my requirements:

    I am trying to build a search application that would require the following features of a search engine:

    1. ability to handle federated searches (collating search results, ranking, etc)

    2. The federated searches have to support Web-services to access data from multiple sources and in some cases be able to go directly against a database.

    I am looking for a pluggable java-based open source solution. Would you have any thoughts on this?


    Thanks...

    ReplyDelete
  7. On the MG4J subject you might mention Mímir -- http://gate.ac.uk/family/mimir.html -- which layers annotation structure and semantic search (via an RDF repository) on top of MG4J.

    A nice list -- thanks!

    ReplyDelete
  8. Thanks for compiling and publishing all these resources, very handy in deed.

    ReplyDelete
  9. where is sphinx , you dont even include it

    ReplyDelete