Wednesday, March 28

Current Open Source Search Engine Libraries

After the popularity of my previous post on open source text mining tools, today I thought I would follow up with my list of open source information retrieval (search) libraries.

Here is my short list of the most important open source [free] information retrieval libraries being used today that are undergoing active development as of writing. It is a mixture of engines used for both research and industrial applications.

Industrial

Hounder - Technically, this could also be grouped with Lucene. Hounder is a complete out of the box search engine by Flaptor. It's written in Java and includes a distributed focused crawler (that includes a classifier), indexing, and search system. It's most similar to Solr and Nutch, see their comparison. It appears to use Lucene as it's underlying search library. Hounder powers Wordpress.com's search capability. Flaptor also claims they have a 300 million document collection running on approximately 30 nodes. They released their cluster management system as Clusterfest.

Lucene - The de-facto commercial standard for search. Lucene is the most widely used information retrieval library in industry. It is in written in Java (although there are now ports in C, C#, Perl, Python, Ruby, and others). Lucene was originally written by Doug Cutting (now at Yahoo working on Hadoop) and is an Apache project. It uses a variation of the TF-IDF vector space model and enforces boolean constraints. Lucene's biggest strength is its very active developer community. It's biggest weakness is scalability and performance on very large data sets. Lucene starts to hit its limit at approximately 5-10 million web documents per commodity web server; see the Hurricane Katrina discussion on the Lucene mailing list (I am open to correction here if someone has new data). Another weakness is that document converters, linguistic analysis tools, and similar plug-ins are not ready out of the box. Lucene is a IR library, not a standalone search engine, for that you need Nutch. An alternative is IBM OmniFind Yahoo Edition which is built on top of Lucene and uses OmniFind's analysis and document conversion tools. OmniFind Yahoo Edition's advertised limit is half a million documents.

Lucene has widespread industry adoption. It is used by Technorati, Monster.com's resume search [announcement on Lucene List], Amazon's Search Inside This Book, and many more. Lucene is the core library of the Nutch open source search engine which powers Krugle. Lucene also powers Solr, a faceted search system donated by CNet. Solr powers CNet's product search. In short, Lucene is a mature and robust IR platform. It is a great choice if you have a small to medium sized data set that needs indexing. It uses the Apache License.

Terrier - TERabyte RetrIEvER from the University of Glasgow. Terrier is a relative newcomer to the commercial space. It was originally designed as a research platform for relevance ranking methods. Specifically, it is a probabilistic engine that uses the "Divergence from randomness" (DFR) model; although it now has a wide variety of ranking implementations including the standard TF-IDF and BM25 models. There is a paper from OSIR 2006 that describes it in more detail. It is released under the Mozilla license.

Xapian - Is an engine written in C++ with a probablistic ranking system. It was originally Open Muscat, but developed at Cambridge University by Dr. Martin Porter (of Porter Stemmer fame). Xapian is the distant offspring of this engine. See its history page for more on its turbulent past. It has commercial support available through two consulting firms who contribute to the project. I'm not too familiar with this engine, but it apparently has several successful deployments, especially in the enterprise space.

Research platforms
Galago - A new Java based search engine from Trevor Strohman, who recently graduated from UMass Amherst and is now at Google. Trevor wrote Galago as part of his thesis. Here is his description:
It includes a distributed computation framework called TupleFlow which is an extension of MapReduce. In addition, it can build three different kinds of indexes, two of which are used in my dissertation, and a third kind which supports a subset of the Indri query language.
From what I understand, Galago is still early in its development. However, it is being used as the platform for the new IR textbook: Search Engines: Information Retrieval In Practice due out in early 2009.

Indri - A joint project between UMass's CIIR and CMU. It is a platform for experimentation with new ranking algorithms, specifically Language Modeling and Inference Networks. The two primary developers are Trevor Strohman and Paul Olgilvie. They gave a tutorial at SIGIR 2006 this past summer, the slides are online. They also have their TREC 2006 paper online, Indri at TREC 2006: Lessons Learned from Three Terabyte Tracks. It has a BSD-inspired license.

MG4J - Managing Gigabytes for Java developed by Sebastiano Vigna and Paolo Boldi from the University of Milano in Italy. From their description: MG4J is a framework for building indices of large document collection based on the classical inverted-index approach. The kind of index constructed is very configurable (e.g., you can choose your preferred coding method), and moreover some new research has gone into providing efficient skips and minimal-interval semantics. It supports flexible scoring schemes, including BM25, and a variety of posting list representations to balance performance and flexibility. It is distributed under the lesser GNU GPL license.

Minion - A new open source Java search engine written by Steve Green and Jeff Alexander from Sun Labs. Minion powers the search capability of Sun's portal server. The description from their recent JavaOne talk:
Minion is a capable full text search engine that provides integrated boolean, relational and proximity querying. Because Minion was developed as a research engine, it is designed to be highly configurable at runtime so that the user can decide which features and capabilities he needs for a particular job.
The closest competitor is Lucene. Steve has a whole series of articles comparing Minion and Lucene.

Wumpus - A project from the University of Waterloo, namely Charles Clarke and Stefan Buttcher. From their description:
One particular scenario that we are studying is file system search (aka "desktop search"), in which the underlying text collection is very dynamic and the number of expected index update operations is much greater than the number of search queries submitted by the users of the system.
However, Wumpus also seems to perform reasonably well on web documents in the TREC Terabyte track competitions. For a good overview of some of their lessons from TREC see: Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval (TREC 2005).

Zettair - From those Aussie's at RMIT down under (including Justin Zobel and Alistair Moffat). It's main emphasis is research on the performance and scalability of search systems. It is written in C. Moffat and others use it as a platform for their work on novel index compression schemes and impact sorted indexes. It scales to very large document collections. It is released under a BSD-style license.

There is a good comparison of the performance of the Zettair, Wumpus, and Indri in the TREC 2006 Terabyte Track paper.

Updated 4-19-2008: Added Xapian to the industrial list and Galago and MG4J to the list of research engines.

Updated 5-18-2008: Added Minion and Hounder.

Also, Christian Middleton of UPF and Ricardo Baeza-Yates from UPF/Yahoo! somewhat recently published A Comparison of Open Source Search Engines. It's a good start, but more details on their experimental methodology (i.e. system configurations) would be helpful. Grant Ingersoll, a Lucene comitter, replied in follow-up blog post and started an interesting discussion.

6 comments:

Otis Gospodnetic said...

Oh, I didn't know Amazon's search inside the book uses Lucene. Where did you see they use Lucene for that?
Simpy uses Lucene, too, of course.

jeff.dalton said...

Hi Otis,

I talked to some of the Amazon developers at SIGIR 2006. They were giving out free t-shirts at their booth :-).

Ken Krugler said...

Hi Jeff,

Thanks for posting this list. Two quick comments...

1. I'd also mention that Lucene is used by Solr, a cool enterprise search server (used by CNET, Krugle and others).

2. The maximum usable size of a single Lucene index has many free variables. For a typical Nutch-generated index, I think the upper bounds of 10-20M documents (assuming standard hardware) is about right. Document size, number of fields, complexity of the query, and amount of RAM are among the factors that can make this number go up or down.

Lucene does support merging results from multiple indexes, and adjusting for IDF skew in the process. The main problem here (IMO) with effectively using this support is that the operational support (e.g. code/scripts for managing federated searchers) doesn't really exist.

laurent said...

Do you know of existing projects that are dedicated to crawl only certain document type: PDF, DOC and PPT?

I'm interested in building a web site that would allow the community of researchers, lawyers, etc. directly search these document types.

I know Google has a filetype: filter, but I'd like to do more around the document: a la Digg, bookmarking them, discussing them, etc.

Anonymous said...

TopX would be another engine from the research world that you could add...

It is now available at:

http://topx.sourceforge.net/

Anonymous said...

Jeff,

Thanks for putting this site together - very informative.

I wanted to get your thoughts on one mof my requirements:

I am trying to build a search application that would require the following features of a search engine:

1. ability to handle federated searches (collating search results, ranking, etc)

2. The federated searches have to support Web-services to access data from multiple sources and in some cases be able to go directly against a database.

I am looking for a pluggable java-based open source solution. Would you have any thoughts on this?


Thanks...