Alan Woodward – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Luwak 1.3.0 released http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/ http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/#respond Tue, 17 Nov 2015 14:13:42 +0000 http://www.flax.co.uk/?p=2802 The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours.  Here’s a summary of the new features and … More

The post Luwak 1.3.0 released appeared first on Flax.

]]>
The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours.  Here’s a summary of the new features and improvements we’ve made:

Batch processing

Inspired by a question raised during our talk at FOSDEM last February, you can now stream documents through the Luwak Monitor in batches, as well as one-at-a-time. This will generally improve your throughput, at the cost of a drop in latency. For example, local benchmarking against a set of 10,000 queries showed an improvement from 10 documents/second to 30 documents/second when the batch size was increased from 1 document to 30 documents; however, processing latency went from ~100ms for the single document to 10 seconds for the larger batch. You’ll need to experiment with batch sizes to find the right balance for your own use.

Presearcher performance improvements

Luwak speeds up document matching by filtering out queries that we can detect won’t match a given document or batch, a process we call presearching. Profiling revealed that creating the presearcher query was a serious performance bottleneck, particularly for presearchers using the WildcardNGramPresearcherComponent, so this has been largely rewritten in 1.3.0. We’ve seen improvements of up to 400% in query build times after this rewrite.

Concurrent query loading

Luwak now ships with a ConcurrentQueryLoader helper class to help speed up Monitor startup. The loader uses multiple threads to add queries to the index, allowing you to make use of all your CPUs when parsing and analyzing queries. Note that this requires your MonitorQueryParser implementations to be thread-safe!

Easier configuration and state monitoring

In 1.2.0 and earlier, clients had to extend the Monitor itself in order to configure the internal query caches or get state update information. Configuration has now been extracted into a QueryIndexConfiguration class, passed to the Monitor at construction, and you can get notified about updates to the query index by registering QueryIndexUpdateListeners.

For more information, see the CHANGES for 1.3.0. We’ll also be re-running the comparison with Elasticsearch Percolator soon, as this has also been improved as part of Elasticsearch’s recent 2.0 release.

The post Luwak 1.3.0 released appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/feed/ 0
Faster bulk indexing in Elasticsearch http://www.flax.co.uk/blog/2015/09/28/faster-bulk-indexing-in-elasticsearch/ http://www.flax.co.uk/blog/2015/09/28/faster-bulk-indexing-in-elasticsearch/#comments Mon, 28 Sep 2015 15:06:26 +0000 http://www.flax.co.uk/?p=2667 We recently did some work for Arachnys who provide data on a wide range of emerging markets. Their data, gathered by a complex process of web crawling, is stored in HBase and served out of a 10-node Elasticsearch cluster. Periodically, … More

The post Faster bulk indexing in Elasticsearch appeared first on Flax.

]]>
We recently did some work for Arachnys who provide data on a wide range of emerging markets. Their data, gathered by a complex process of web crawling, is stored in HBase and served out of a 10-node Elasticsearch cluster. Periodically, better ways of extracting data from the raw crawl will be implemented and the entire dataset will need to be reprocessed and reindexed. Data was being extracted from HBase using a map scan, and written to Elasticsearch using the elasticsearch-hadoop plugin (maintained by Elastic). Flax was called in when it was discovered that with their existing configuration it would take about a month to reindex all of the data (over 1 billion documents, around 73 terabytes on disk). Indexing was also throwing large numbers of errors and moving forward at about 700 documents per second.

Running some local tests with similar data showed Elasticsearch was indexing about 400 documents per second on a single node – a 10 node cluster on dedicated server hardware ought to be doing a lot better than 700 documents per second!

Our first step was to look at the errors being produced. Most were simple timeouts, but some included a ‘too many open files’ error message. This normally happens when the server Elasticsearch is running on has its ulimit set too low, but in this case the ulimit was set at 65535 file handles, which should be more than enough for a server with a single index.

To investigate this further we used the very nice Segment Spy plugin, which shows you the structure of Elasticsearch’s underlying Lucene index. A Lucene index is divided into multiple immutable segments, and a normal index will probably have around 30 segments of varying sizes. Segment Spy revealed that our index contained several thousand very small segments. Each open segment can use between 10 and 15 file handles, which would explain our ‘too many open files’ error.

To understand how this could happen we need to consider how Lucene manages incremental indexing. As documents are added to a Lucene index, they are buffered in memory until a commit is triggered (either explicitly, or by passing memory or document count limits), at which point they are written out as a segment. Lucene also allows concurrent indexing: your indexing process can have multiple threads, each with their own document writer, and when a writer’s buffer is filled it is swapped out for a new, empty writer, while the filled one is flushed to disk.

Once a segment has been written another process will check the state of the index as a whole and decide if certain segments can be merged together. This merging process usually happens concurrently (in a background thread) and should prevent the scenario we were seeing above by ensuring that multiple small segments are quickly merged together into larger ones.

Merging can be I/O intensive and so Elasticsearch has a feature called merge throttling that will try and ensure that enough system resources are available to serve queries. In our case, it seems that the throttling meant that the merge simply couldn’t keep up with the number of indexing threads.

There were two overlapping issues here: too much concurrent indexing, and merges not keeping up. The HBase scan was divided into 256 separate mappers, and as no use was being made of document routing, each mapper job would end up sending documents to each node. This level of concurrency was clearly too high.

By reducing the number of map jobs to 10, and switching off merge throttling, the indexing rate jumped to approximately 4000 documents per second. This was good – but we wanted to see if we could improve this still further! There was still a fair amount of idle CPU time over the cluster, suggesting that there were more increases in document throughput to be gained.

The elasticsearch-hadoop plugin is designed to be very small and self-contained, with as few dependencies as possible. It uses a single-threaded model for indexing, batching records up, sending them in bulk and then blocking until Elasticsearch responds. We replaced this with our own Hadoop OutputFormat implementation using an Elasticsearch TransportClient and BulkProcessor. The BulkProcessor uses a threaded model that sends bulk requests to Elasticsearch using a configurable number of background threads, meaning that there is little or no time spent waiting for responses. This makes error handling a little more tricky (although doable), but we found that it doubled indexing speed again, and this time CPU was pretty much maxed out across the cluster.

Turning off replication gave an added boost, and our final tests gave a reindex time of approximately 17 hours for plain indexing – a marked improvement on the initial 30 day turnaround time!

The post Faster bulk indexing in Elasticsearch appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/09/28/faster-bulk-indexing-in-elasticsearch/feed/ 1