There was a time (a long time ago, but I remember) when adding full-text, semi-real-time search to an application was a daunting task — something only attempted by the likes of AltaVista. The Apache Lucene project brought this capability to the masses. I had a copy of Lucene in Action in-hand soon after it was released in 2004 and it served me well as my first guide to Lucene development. Fast-forward a few years to 2007; we're plotting the addition of search to our newly-minted "Ganja" publishing system. Time to dust off that book!
I've been asked, "Why didn't you just use Solr?" We tried this first, actually. At the heart of our current system is a JSON-over-HTTP API. We'd like search results to be returned in exactly the same way as other requests. Either Solr has to store enough data for the expected API response or an additional HTTP request must be made from the application to Solr to obtain the ids of matching posts. We tried the first option, but keeping Solr synchronized with new data proved to be impossible.
It made a lot of sense to integrate Lucene-driven search with our existing API service. Here's how it works.
- New and modified posts are periodically selected from the datastore. New posts are selected at a much higher frequency than modified posts. This way, bulk-modifications to old posts don't delay the visibility of new content.
- Selectors put posts into a concurrent queue. It happens to be a LinkedBlockingQueue with a configurable size. Why not use a PriorityBlockingQueue to eliminate having both new and modified selectors? We really want the size to have an upper-bound so "backpressure" is applied to selection if indexing falls behind.
- One or more indexer threads poll the queue for posts and update the index with documents that have a single stored field: the post id. (I'd like to explore IndexWriter concurrency in a future post. Does it even make sense to have more than one thread writing? Let's find out.)
- The "index housekeeper" periodically performs searcher refresh and occasionally performs full commits to disk. In early versions of Lucene, indexed documents weren't visible until they were committed. This limited how "real-time" an index could be. Modern versions of Lucene support "near-real-time" search. Updates happen first in-memory and are made visible by a quick "searcher refresh" call.
- The application makes queries and is rewarded (hopefully) with the ids of indexed documents. Lucene's internal ids are volatile; there's no immutable mapping between the internal id and the stored post id. Lucene provides a method to fetch a document value for an internal id, but it is random-access and potentially a performance bottleneck. (I haven't tested, but I suspect this is actually a non-issue with SSDs.) Lucene 4 introduces column-stride fields as a solution to the random-access problem. I hope to publish some tests related to this in a later post.
This basic system has been in place through three back-end design iterations, mostly unchanged. With far more powerful hardware and SSDs, the index has grown from around a million posts to 52 million today.