We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.
The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.
The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8”). This seemed to be the most likely cause of poor performance under load.
Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.
What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.
In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!
Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.