design – Flax

Principles of Solr application design – part 2 of 2

Charlie Hull — Tue, 17 Dec 2013 15:46:00 +0000

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:
˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃ ˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃ ˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:
˂requestHandler name="/search" class="solr.SearchHandler"˃ ˂lst name="defaults"˃ ˂str name="qf"˃content_en content_fr content_jp˂/str˃ ˂str name="pf"˃content_en content_fr content_jp˂/str˃ ...
When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

The post Principles of Solr application design – part 2 of 2 appeared first on Flax.

Principles of Solr application design – part 1 of 2

Charlie Hull — Wed, 11 Dec 2013 12:02:44 +0000

1. Use the latest release of Solr

Unless there are compelling reasons not to, such as reliance on a discontinued feature (which is rare), it is best to use the latest release of Solr, downloaded from http://lucene.apache.org/solr/ . Every minor release in the 4.x series has brought both functional and performance enhancements, and revision releases have fixed known bugs. Since the API (as a rule) remains backwards compatible, the potential gains in performance and utility should outweight the minor inconvenience of the upgrade.

2. Use SolrCloud for scaling and robustness

Before the Solr 4 release, support for sharding (distributing a single search over many Solr instances) and replication (for robustness and scaling search load) involved a significant amount of manual configuration and development. The introduction of SolrCloud means that sharding and replication are now built into the core product, and can be used with simple configuration and no extra coding.

For trivial applications, SolrCloud may not be required, but it is the simplest way to build in robustness and scalability. There’s more about SolrCloud here.

3. Don’t expose the Solr API

Although Solr is not inherently insecure, neither is it designed to be exposed to end-users (and emphatically not to the internet at large). Anyone with access to the root Solr endpoint would be able to delete indexes, modify or insert items at will. Restricting access to search handlers (e.g. /solr/select) avoids this possibility, but is nonetheless a bad idea since it may allow users to construct arbitrary queries which could degrade performance or provide access to unauthorised data. Furthermore, there remains the slim possibility of security holes in the Solr API.

For these reasons, any external access to search should be through a proxy interface which is restricted to the functionality required by the application. Access to the Solr API should be restricted by network design and/or firewalls. This applies equally to AJAX UIs, which should talk to Solr via an intermediary web application rather than directly.

The intermediary code should perform at least some basic validation of parameters before sending to Solr, for example checking their type and ensuring that query strings are under a certain length (depending on the search interface). This allows attempts at compromising the system to be detected at an early stage and blocked.

4. Don’t use third-party Solr client libraries

The problem with third-party client libraries is that they create a tight coupling between the application and Solr. The Solr XML and JSON APIs are simple, and a wide range of client libraries for these formats are readily available for most programming languages. Third-party libraries are an unnecessary additional dependency and a potential source of bugs and unexpected behaviour. Another risk is that development may be discontinued for various reasons, meaning that future Solr features are not easily accessible.

The one exception to this rule is the SolrJ Java client library, since it is part of the general Solr release and is therefore fully compliant with and tested against the corresponding version of Solr.

5. Specify interfaces

All interfaces between components in the application must be agreed between sys ops and developers before development is started. Interfaces should be treated as contracts which software components adhere to. Early documentation of interfaces will reduce the risk of unexpected dependencies leading to problems in deployment.

As far as possible, interfaces should be RESTful web APIs and use standard formats such as JSON and XML. This creates loose coupling between components and also makes it easy to test functionality from the command line or a browser.

6. Put apps live early, on isolated systems

Development should be iterative, with short development cycles (no more than a few weeks). Code should be tested and deployed at the end of each cycle. By using isolated systems, fake data and/or limiting access to authorised testers, functionality and performance may be tested as soon as possible on a ‘live’ system, avoiding the risk of unexpected problems if deployment is postponed until the end of the development cycle.

7. Do realistic performance tests early and often

Except for very small indexes, search performance is often unpredictable, particularly under load. To ensure that performance meets requirements, testing a full index under load with realistic queries should be scheduled as early as possible in development. If you don’t have the data available to create a full index, simulate it (e.g. using freely available text such as Wikipedia).

As new functions, e.g. facets, are added performance characteristics may change significantly, so it is important that performance tests are part of every development cycle. JMeter is a popular tool for load testing; alternatively test scripts could be easily written in a language like Python.

More to come next week!

The post Principles of Solr application design – part 1 of 2 appeared first on Flax.