Thursday, March 29

Not your average trailer park trash, Mobile Data Centers Part II

Last year I wrote about Google's data center in a trailer project, if you missed it, it is a great read (a spoof on cold war maneuvering). As it turns out Google canceled the project (at least that's what we're being told...), but other providers have created similar offerings, namely Sun's Project Blackbox and now Rackable's more non-descript Concentro.

A brief description of Concentro's latest:
Designed to augment or replace traditional brick-and-mortar data centers of any size, Concentro features compute density levels of up to 9600 cores in a 40' x 8' container, or up to 3.5 petabytes of storage for maximum density per square foot.
Take a look at the pictures from both companies, they look like something out of Knight Rider or Tron.

Google should not be complacent after all, James Hamilton and the rest of the enemy are fighting back. His website has his latest presentations and papers, documenting Microsoft's war effort in this arena.

There is other coverage on Geeking with Greg and Nick Carr (Showdown in the trailer park).

Conference coverage: ICSWSM and E-Tech

Now that Spring is upon us, conference time is beginning. This week there were two interesting conferences:
The International Conference on Weblogs and Social Media (ICWSM)

There is more information on the ICWSM blog and also a few brief mentions by Matthew Hurst.

Also this week there was O'Reilly's Emerging Technologies 2007 (ETech)

O'Reilly Radar has a great roundup. The ETech Blog has coverage, including some interesting podcasts for download.

Wednesday, March 28

Open Source Search Engines, Retrieval Tools and Libraries

Updated 8/19/2009: The tools on this page are still pretty current. It's worth noting that you may wish to see Vik Singh's comparison of open source search engines. I do have issues with his results: I know from significant experience that Lucene does not have good relevancy out of the box and since it appears to come out on top in this area, I'm very skeptical of the results.

After the popularity of my previous post on open source text mining tools, today I thought I would follow up with my list of open source information retrieval (search) libraries. Here is my short list of the most important open source [free] information retrieval libraries used today. I grouped them roughly by their use in industry vs academia.

Industrial
Lucene - The de-facto commercial standard for search. Lucene is the most widely used information retrieval library in industry. It is in written in Java (although there are now ports in C, C#, Perl, Python, Ruby, and others). Lucene was originally written by Doug Cutting and is an Apache project. Lucene's biggest strength is its very active developer community. It's biggest weaknesses are ranking and scalability. It's ranking model uses a variation of the TF-IDF vector space model and enforces boolean constraints. It is a bit outdated and requires significant tweaking to produce relevant results, but it can be done, see my article Lucene Relevance beats state-of-the-art for details. Lucene also has performance and scalability issues on web-scale document collections. Lucene starts to hit its limit at approximately 5-10 million web documents per commodity web server; see the Hurricane Katrina discussion on the Lucene mailing list.

It's important to realize that Lucene is a IR library, not a standalone search engine, for that you need Nutch. Because of this, the systems a search engine needs: crawlers, document converters, linguistic analysis tools, and similar plug-ins are not ready out of the box.

Lucene has widespread industry adoption. It is used by Technorati, Monster.com's resume search [announcement on Lucene List], Amazon's Search Inside This Book, and many more. Lucene is the core library of the Nutch open source search engine which powers Krugle. Lucene also powers Solr, a faceted search system donated by CNet. Solr powers CNet's product search. In short, Lucene is a mature and robust IR platform. It is a great choice if you have a small to medium sized data set that needs indexing. It uses the Apache License.

Terrier - TERabyte RetrIEvER from the Information Retrieval Group at the University of Glasgow. Terrier is both an academic and commercial engine. It was originally designed as a research platform for relevance ranking methods. Specifically, it is a probabilistic engine that uses the "Divergence from randomness" (DFR) model; although it supports many of the common relevance implementations including the standard TF-IDF and BM25 models. There is a paper from OSIR 2006 that describes it in more detail. The latest release, Terrier 2.2 supports distributed indexing via Hadoop. Don't miss the Terrier Team's blog. It is released under the Mozilla license.

Hounder - Technically, this could also be grouped with Lucene. Hounder is a complete out of the box search engine by Flaptor. It's written in Java and includes a distributed focused crawler (that includes a classifier), indexing, and search system. It's most similar to Solr and Nutch, see their comparison. Hounder powers Wordpress.com's search capability. Flaptor also claims they have a 300 million document collection running on approximately 30 nodes. They released their cluster management system as Clusterfest.

Xapian - Is an engine written in C++ with a probablistic ranking system. It was originally Open Muscat, but developed at Cambridge University by Dr. Martin Porter (of Porter Stemmer fame). Xapian is the distant offspring of this engine. See its history page for more on its turbulent past. It has commercial support available through two consulting firms who contribute to the project. I'm not too familiar with this engine, but it apparently has several successful deployments in the enterprise search space.

Research platforms
Galago - A Java based search engine from Trevor Strohman, who recently graduated from UMass Amherst and is now develops infrastructure at Google. Trevor wrote Galago as part of his thesis. Here is his description:
It includes a distributed computation framework called TupleFlow which is an extension of MapReduce. In addition, it can build three different kinds of indexes, two of which are used in my dissertation, and a third kind which supports a subset of the Indri query language.
From what I understand, Galago is still early in its development. However, it is being used as the platform for the new IR textbook: Search Engines: Information Retrieval In Practice due out in early 2009.

Indri & Lemur - A joint project between UMass's CIIR and CMU. Indri is the search engine in the Lemur language modeling toolkit. It was developed as a platform for experimentation with ranking algorithms, specifically Language Modeling and Inference Networks. The two primary developers are Trevor Strohman and Paul Olgilvie. They gave a tutorial at SIGIR 2006 this past summer, the slides are online. They also have their TREC 2006 paper online, Indri at TREC 2006: Lessons Learned from Three Terabyte Tracks. It has a BSD-inspired license.

Minion - A new open source Java search engine written by Steve Green and Jeff Alexander from Sun Labs. Minion powers the search capability of Sun's portal server. The description from their recent JavaOne talk:
Minion is a capable full text search engine that provides integrated boolean, relational and proximity querying. Because Minion was developed as a research engine, it is designed to be highly configurable at runtime so that the user can decide which features and capabilities he needs for a particular job.
The closest competitor is Lucene. Steve has a whole series of articles comparing Minion and Lucene.

MG4J - Managing Gigabytes for Java developed by Sebastiano Vigna and Paolo Boldi from the University of Milano in Italy. From their description: MG4J is a framework for building indices of large document collection based on the classical inverted-index approach. The kind of index constructed is very configurable (e.g., you can choose your preferred coding method), and moreover some new research has gone into providing efficient skips and minimal-interval semantics. It supports flexible scoring schemes, including BM25, and a variety of posting list representations to balance performance and flexibility. It is distributed under the lesser GNU GPL license.

Wumpus - A project from the University of Waterloo, namely Charles Clarke and Stefan Buttcher. From their description:
One particular scenario that we are studying is file system search (aka "desktop search"), in which the underlying text collection is very dynamic and the number of expected index update operations is much greater than the number of search queries submitted by the users of the system.
However, Wumpus also seems to perform reasonably well on web documents in the TREC Terabyte track competitions. For a good overview of some of their lessons from TREC see: Efficiency vs. Effectiveness in Terabyte-Scale Information Retrieval (TREC 2005).

Zettair - From those Aussie's at RMIT down under (including Justin Zobel and Alistair Moffat). It's main emphasis is research on the performance and scalability of search systems. It is written in C. Moffat and others use it as a platform for their work on novel index compression schemes and impact sorted indexes. It scales to very large document collections. It is released under a BSD-style license.

Search library comparisons
There is a good comparison of the performance of the Zettair, Wumpus, and Indri in the TREC 2006 Terabyte Track paper.

Also, Christian Middleton of UPF and Ricardo Baeza-Yates from UPF/Yahoo! somewhat recently published A Comparison of Open Source Search Engines. It's a good start, but more details on their experimental methodology (i.e. system configurations) would be helpful. Grant Ingersoll, a Lucene comitter, replied in follow-up blog post and started an interesting discussion.

Top industrial choices
If you need a simple solution for small to medium scale deployments I highly recommend Lucene, although it requires getting your hands dirty to make it work well. There are also a number of companies that provide Lucene support. Minion would be another choice here to consider, but it is less mature and lacks commercial support.

For larger scale retrieval I would recommend a Lucene derivative or possibly Terrier. However, in my experience, if you are trying to build Google/Yahoo/Microsoft scale search engine none of these will do the job in a cost-effective manner.

Top academic choices
My top choices here are Indri/Lemur, Terrier, Galago, and Zettair. Disclosure: I started attending UMass, which developed Indri and Galago.

Updated 4-19-2008: Added Xapian to the industrial list and Galago and MG4J to the list of research engines. Updated 5-18-2008: Added Minion and Hounder. Updated 1-17-2009: Updated descriptions based on recent releases.

Tuesday, March 27

Cool link: Applications of Artificial Intelligence to Information Retrieval

I ran across this need link today:
Applications of AI: Information Retrieval from the Association for the Advancement of Artificial Intelligence. A description:

Our accustomed systems of retrieving particular bits of information no longer fill the needs of many people... Artificial intelligence may hold the key to information retrieval in an age where widely different formats contain the information being sought, and the universe of knowledge is simply too big and growing too rapidly for successful searching to proceed at a human's slow speed.
It is a great resource of links on the application of AI to information extraction, text mining, machine translation, personalization, text mining, and lots of interesting applications worth reading about.

One particular link of note (of many) is the link to Peter Norvig's presentation Theorizing from data: Avoiding the capital mistake given at UC Berkeley last September.

Monday, March 26

Query Expansion: an alternative to static stemming

Last Friday, I wrote about the Stemming Dilemma and suggested query expansion as an alternative to indexing stems. Here are some more thoughts on query expansion.

First, my definition:
Query expansion is the process of automatically adding or suggesting new search terms in response to a query. For example, a query for motor might be expanded to: (motor OR motors), pluralization; or synonyms expanded so that automobiles are included.

First, from the user's perspective query expansion can be equivalent to stemming/lemmatization. Like stemming, query expansion can improve recall, sometimes at the cost of precision (by introducing non-relevant documents).

Query expansion offers greater flexibility than indexing the stem word in same position as the original term in the document because query expansion can be enabled on a per query basis and query term weights can changed on a per query basis. Enabling an expansion on a per-query level is important because an expansion may be beneficial for one query, but be detrimental in another. One significant drawback to query expansion is that adding words to a user's query creates a more complex query that is more resource intensive to answer. Like stemming, it can also have a detrimental effect on query precision unless used with care.

Query Expansion in action
Google allows query expansion in their enterprise search product, here's the how-to from their blog.).

LucQue - A module for doing query expansion with Lucene. It uses Google's web API to find terms to use for query expansion.

Recommended Reading:
Improving Automatic Query Expansion
Automatic Query Expansion Using SMART : TREC 3
Probabilistic Query Expansion Using Query Logs
Query Expansion Using Local and Global Document Analysis
Introduction to Information Retrieval, Chapter 9: Query Expansion and Relevance Feedback

It's interesting to note that most of these early papers deal with small corpuses (at least in comparison with today's web) and expand queries with hundreds of terms. Using small corpuses significant query expansion / stemming can significantly improve recall and therefore greatly improve effectiveness. However, as corpus size expands precision becomes more important and query expansion can introduce noise, reducing effectiveness (especially important for web search). Also, the extra resources needed to answer expanded queries on smaller corpuses is not as problematic as it can become on large data sets. In short, it can have significant utility on smaller corpuses, but is perhaps not as helpful on large corpuses.

Query expansion is also often closely related to relevance feedback... but that's fodder for a future post.