Thursday, April 28

Greplin: Personal Search for the Social Network Era

Greplin is a cloud search service that indexes your social network and personal information stored in web services. It provides a central hub for searching all your online data. Greplin is a small startup company with six engineers. Instead of building its own cluster, it leverages Amazon EC2 for indexing capacity. The TechCrunch article reports today that:
They’ve now indexed some 1.5 billion documents. And they’re indexing about 30 million new documents per day.
The TechCrunch article exaggerates the scale issue. The more significant scale issues relate to query volume, and the article does not report on those numbers. Furthermore, a large component of the documents Greplin indexes are short FB and Twitter updates. Greplin has more relaxed indexing requirements than real-time search: in the FAQ Greplin says it can take up to 20 minutes or even up to a day to index your documents.

My current Greplin index has approximately 54,000 documents. It has 30k from Gmail, 7k from Facebook, 17k from Twitter, and around 500 from LinkedIn. The basic search functionality seems reasonable enough. It is very snappy with search as you type. The advanced search capabilities are a bit limited. For example, search by date is missing.

Greplin is still in its infancy. The search interface could benefit blending document results from different sources into a more unified result list. For example, see the recent work on "aggregated" and "federated" search [e.g. A Methodology for Evaluating Aggregated Search Results from ECIR 2011]. Furthermore, I would like a faceted search UI to support exploratory search. They could learn a lot by looking at the extensive research on Personal Information Management (PIM) and Desktop search, like Jamie Teevan's research along with Sue Dumais' work on Landmarks and Stuff I've Seen. (For more on PIM - you can also read Jinyoung Kim, one of my labmates).

I have significant reservations concerning my data privacy. Do I trust Greplin with my indexed data? It needs at least partial copies to show snippets of results. At least it claims I can delete my indices for a service at any time. However, it is a very coarse mechanism. There is no version of a robots.txt for my personal data so that I can specify mechanisms for "do not index" or "do not cache" at a granular level.

I have a few invites. If you want to try it out leave a request in the comments.