According to the blog, the new search system was designed to handle over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day . Besides the challenging query volume, the data needs to be available quickly, a tweet needed to be searchable in less than 10 seconds.
They turned to Lucene, a popular open source search library. To meet their latency and query serving requirements they needed to make extensive modifications to Lucene's core,
That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:Their post says that these contributions will be rolled into Lucene and the Lucene realtime branch. It is a tantalizing overview and I would really like to see pointers to the details (e.g. JIRA issues).
- significantly improved garbage collection performance
- lock-free data structures and algorithms
- posting lists, that are traversable in reverse order
- efficient early query termination
The main benefit to users is that the new system is much more scalable and can support an index that is twice as large as previous versions which means that you can search for tweets further back in time.
An interesting parallel is that LinkedIn recently released LinkedIn Signal, which is a mashup of Twitter data with LinkedIn social network data for professionals. For details on how that system works, see their Signal Under the Hood post. One of the key components is the Zoie real-time search system built on top of Lucene.