Thursday, October 7

Twitter Launches Lucene Real-Time Search Architecture

The Twitter Search engineering team announced that they launched a New Twitter search Architecture. The previous system was based on the original Summize search system that Twitter acquired in 2008. The old technology was a MySQL-based system that became difficult to scale. I'm really amazed that they were able to make a MySQL search work for this long.

According to the blog, the new search system was designed to handle over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day . Besides the challenging query volume, the data needs to be available quickly, a tweet needed to be searchable in less than 10 seconds.

They turned to Lucene, a popular open source search library. To meet their latency and query serving requirements they needed to make extensive modifications to Lucene's core,
That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:
  • significantly improved garbage collection performance
  • lock-free data structures and algorithms
  • posting lists, that are traversable in reverse order
  • efficient early query termination
Their post says that these contributions will be rolled into Lucene and the Lucene realtime branch. It is a tantalizing overview and I would really like to see pointers to the details (e.g. JIRA issues).

The main benefit to users is that the new system is much more scalable and can support an index that is twice as large as previous versions which means that you can search for tweets further back in time.

An interesting parallel is that LinkedIn recently released LinkedIn Signal, which is a mashup of Twitter data with LinkedIn social network data for professionals. For details on how that system works, see their Signal Under the Hood post. One of the key components is the Zoie real-time search system built on top of Lucene.