Most search applications work by indexing a relatively stable collection of documents and then allowing users to perform ad-hoc searches to retrieve relevant documents. However, in some cases it is useful to turn this model on its head, and match individual documents against a collection of saved queries. I shall refer to this model as “streamed search”.
One example of streamed search is in media monitoring. The monitoring agency’s client’s interests are represented by stored queries. Incoming documents (e.g. from the Twitter firehose) are matched against the stored queries, and hits are returned for further processing before being reported to the client. Another example is financial news monitoring, to predict share price movements.
In both these examples, queries may be extremely complex (in order to improve the accuracy of hits). There may be hundreds of thousands of stored queries, and documents to be matched may be incoming at a rate of hundreds or thousands per second. Not surprisingly, streamed search can be a demanding task, and the computing resources required to support it a significant expense. There is therefore a need for the software to be as performant and efficient as possible.
The two leading open source streamed search implementations are Elasticsearch Percolator, and Luwak. Both depend on the Lucene search engine. As the developers of Luwak, we have an interest in how its performance compares with Percolator. We therefore carried out some preliminary testing.
Ideally, we would have used real media monitoring queries and documents. However, these are typically protected by copyright, and the queries represent a fundamental asset of monitoring companies. In order to make the tests distributable, we chose to use freely dowloadable documents from Wikipedia, and to generate random queries. These queries were much simpler in structure than the often deeply nested queries from real applications, but we believe that they still provide a useful comparison.
The tests were carried out on an Amazon EC2 r3.large VM running Ubuntu. We wrote a Python script to download, parse and store random Wikipedia articles, and another to generate random queries from the text. The query generator was designed to be somewhat “realistic”, in that each query should match more than zero documents. For Elasticsearch, we wrote scripts to index queries into the Percolator and then run documents through it. Since Luwak has a Java API (rather than Elasticsearch’s RESTful API), we wrote a minimal Java app to do the same.
10,000 documents were downloaded from Wikipedia, and 100,000 queries generated for each test. We generated four types of query:
- Boolean with 10 required terms and 2 excluded terms
- Boolean with 100 required terms and 20 excluded terms
- 20 required wildcard terms, with a prefix of 4 characters
- 2-term phrase query with slop of 5
We ran the tests independently, giving Luwak and Elasticsearch a JVM heap size of 8GB, and doing an initial pre-run in order to warm the OS cache (this did not actually have a noticable effect). For sanity, we checked that each document matched the same queries in both Luwak and Percolator.
The results are shown in the graphs below, where the y-axis represents average documents processed per second.
Luwak was consistently faster than Percolator, ranging from a factor of 6 (for the phrase query type) to 40 (for the large Boolean queries).
The reason for this is almost certainly due to Luwak’s presearcher. When a query is added to Luwak, the library generates terms to index the query. For each incoming document, a secondary query is constructed and run against the query index, which returns a subset of the entire query set. Each of these is then run against the document in order to generate the final results. The effect of this is to reduce the number of primary queries which have to be executed against the document, often by a considerable factor (at a relatively small cost of executing the secondary query). Percolator does not have this feature, and by default matches every primary query against every document (it would be possible, but not straightforward, for an application to implement a presearch phase in Percolator). Supporting this analysis, when the Luwak presearcher was disabled its performance dropped to about the same level as Percolator.
These results must be treated with a degree of caution, for several reasons. As already explained, the queries used were randomly generated, and far simpler in structure than typical hand-crafted monitoring queries. Furthermore, the tests were single-threaded and single-sharded, whereas a multithreaded, multi-shard, distributed architecture would be typical for a real-world system. Finally, Elasticsearch Percolator is a service providing a high-level, RESTful API, while Luwak is much lower level, and would require significantly more application-level code to be implemented for a real installation.
However, since both Luwak and Percolator use the same underlying search technology, it is reasonable to conclude that the Luwak presearcher can give it a considerable performance advantage over Percolator.
If you are already using Percolator, should you change? If performance is not a problem now and is unlikely to become a problem, then the effort required is unlikely to be worth it. Luwak is not a drop-in replacement for Percolator. However, if you are planning a data-intensive streaming search system, it would be worth comparing the two. Luwak works well with existing high-performance distributed computing frameworks, which would enable applications using it to scale to very large query sets and document streams.
Our test scripts are available here. We would welcome any attempts to replicate, extend or contest our results.