Thursday, April 12

Amazon CloudSearch, Elastic Search as a Service

The search division of Amazon, A9 today announced the release of CloudSearch.  Amazon CTO Werner Vogels announced it on his blog, All Thing Distributed.  The AWS service also has a new post on the announcement.

For the details and pricing, there is also the official CloudSearch details page.

CloudSearch is a fully managed search service based on Amazon's search infrastructure that provides near-realtime, faceted, scalable search.  The index is stored in memory for fast search and updates.

Dynamic Scaling
What makes A9 offering particularly interesting is it's ability to dynamically scale.  The architecture of A9's search system, with shards and replicas, is a common and well-understood model.  What makes Amazon's offering unique is the ability to easily scale your search cluster.  A9 will automatically add (and remove) search instances and index partitions as the index size grows and shrinks.  It will also dynamically add and remove replicas to respond to changes in search request traffic.    The exact details are still not clearly described technically in detail.

Right now, there is a limit to 50 search instances.  An extra large search instance can handle approximately 8 Million 1K documents. It appears that assumption is that the documents are quite small (e.g. product documents).  To put it in perspective, an rough rule of thumb for web documents is approximately 10k.  Given this, it translates into roughly 800k web documents per server * 50 servers = 40 million web documents.  This is not for building large-scale web search, yet.  However, it should be more than enough for most enterprise e-commerce and site-search applications.

The real value added by the search engine is in the ranking of results.

The control over the search index ranking is rudimentary with a few basic knobs.  You can add stopwords, perform stemming, and add synonyms.  This is very basic stuff.    How you might do more interesting (and important) IR ranking changes is vague.  From the article,
Rank expressions are mathematical functions that you can use to change how search results are ranked. By default, documents are ranked by a text relevance score that takes into account the proximity of the search terms and the frequency of those terms within a document. You can use rank expressions to include other factors in the ranking. For example, if you have a numeric field in your domain called 'popularity,' you can define a rank expression that combines popularity with the default text relevance score to rank relevant popular documents higher in your search results.
This indicators that it is possible to boost documents.  However, it is unclear how the underlying text search works in order to boost individual important fields (e.g. name, description).

For more details on the more advanced query processing needed to make search work in practice, read the post: Query Rewriting in Search Engines from Hugh Williams at EBay.  In order to employ these methods, you need log data, which brings me to my next point.

Missing Pieces
A key missing component is usage-driven framework to improve ranking that uses queries, clicks, and other user behavior indicators.  A feedback mechanism to change ranking based analysis (ideally automatic).

Overall, the most compelling aspect of this is the dynamic scaling.  It gives people a simple, platform that scales transparently for many enterprise search and ecommerce applications.