Running out of disk space with Elasticsearch and Solr: a solution

We recently did a proof-of-concept project for a customer which ingested log events from various sources into a KafkaLogstashElasticsearchKibana stack. This was configured with Ansible and hosted on about a dozen VMs inside the customer’s main network.

For various reasons resources were tight. One problem which we ran into several times was running out of disk space on the Elasticsearch nodes (this was despite setting up Curator to delete older indexes, and increasing the available storage as much as possible). Like most software, Elasticsearch does not always handle this situation gracefully, and we often had to ssh in and manually delete index files to get the system working again.

As a result of this experience, we have written a simple proxy server which can detect when an Elasticsearch or Solr cluster is close to running out of storage, and reject any further updates with a configurable error (503 Unavailable would seem to be the most appropriate) until enough space is freed up for indexing to continue. We call this Hara Hachi Bu, after the Confucian teaching to only eat until you are 80% full. It is available to download on GitHub and has the Apache 2.0 license. This is a very early release and we would welcome feedback or contributions. Although we have tested it with Elasticsearch and Solr, it should be adaptable to any data store with a RESTful API.

Technical stuff

The server is implemented using DropWizard (version 0.9.2), a framework we’ve used a lot for its ease of use and configurability. It is intended to sit between an indexer and your search engine (or a similar disk-based data store), and will check that disk space is available when requesting certain endpoints. If the disk space is less than a configured threshold value, the request will be rejected with a configurable HTTP status code.

There are disk space checkers for Elasticsearch (using the /_cluster/stats endpoint), a local Solr installation, or a cluster of hosts. If using a cluster, each machine is required to regularly post its disk space to the application. Custom implementations can also be added, by implementing the DiskSpaceChecker interface.

The trickiest part of the implementation was to allow DropWizard endpoints through without them being proxied. We did this by implementing both a filter and a servlet – the filter looks out for locally known endpoints and passes them straight through, while unknown endpoints have a /proxy prefix added to the URL path and then caught by the proxy servlet. The filter also carries out the disk space check on URLs in the check list, allowing them to be rejected before reaching the servlet. (If you’ve come up with a different solution to this problem, we’d be interested to hear about it.)

The proxy was implemented by extending the Jetty ProxyServlet (http://www.eclipse.org/jetty/documentation/current/proxy-servlet.html) – this allowed us to override a single method in order to implement our proxy, stripping off the /proxy prefix and redirecting the request to the configured host and port.

Internally, the application will build the DiskSpaceChecker defined in the configuration. DropWizard resources (or endpoints) and health checks are added depending on the implementation, with a default, generic health check which simply checks whether or not disk space is currently available. The /setSpace resource is only available when using the clustered configuration, for example.

2 thoughts on “Running out of disk space with Elasticsearch and Solr: a solution

  1. How is this used in conjunction with low and high watermarks? Couldn’t you just set those to manage your disk space? When the low watermark is breached, new shards are unable to be created and a 503 is thrown IIRC. Is your solution designed to prevent new data being added to existing shards over disk utilization percentage?

    • Hi David, thanks for the comment. We had automatic shard reallocation turned off, because this project used a hot/warm search architecture. Other than that, we had the default values for the watermarks (85% and 90% I believe). In any case, this never threw any 503s, and we kept hitting 100%. The ES documentation is unfortunately not very clear on the precise behaviour when hitting a watermark. – Tom

Leave a Reply

Your email address will not be published. Required fields are marked *