Something revolutionary has been happening recently in the UK government with regard to open source software, standards and data. Change has been promised before and some commentators have been (entirely correctly) cynical about the eventual result, but it seems that finally we have some concrete results. Not content with a public policy and procurement toolkit for open source software, the Cabinet Office today released a policy on open standards – and unlike many had feared, they have got it right.
Why do open standards matter? Anyone who has attempted to open a Word document of recent vintage in an older version of the same software will know how painful it can be. In the world of search we often have to be creative in how we extract data from proprietary, badly documented and inconsistent formats (get thee behind me, PDF!) – at Flax we came up with a novel method involving a combination of Microsoft’s IFilters and running Open Office as a server (you can download our Flax Filters as open source if you’d like to see how this works). If all else fails it is sometimes possible to extract strings from the raw binary data. However, we generally don’t have to preserve paragraphs, formatting and other specifics – and that is the kind of fine detail that often matters, especially in the government or legal arena. Certain companies have been downright obstructive in how they define their ’standards’ (and I use that word extremely loosely in this case). The same companies have been accused by many of trying to influence the Cabinet Office consultation process, introducing the badly defined FRAND concept. However, the consultation process has been carefully and correctly run and the eventual policy is clear and well written.
It will be very interesting to see how commercial closed source companies react to this policy – but in the meantime those of us in the open source camp should be cheered by the news that finally, after many false starts and setbacks, ‘open’ really does mean, well, ‘open’.
The exciting GOV.UK project is getting close to its first release date of October 17th and we were asked by them to help with some search tuning as they migrate from Apache Solr to elasticsearch. Although elasticsearch has some great features there are still some areas where it lags Solr, such as the lack of spelling suggestion and proximity boost features. Alan from Flax spent a couple of days working with the GDS team and has blogged about how proximity boosting in particular can be implemented – at least for terms that are relatively close to each other rather than being separated by a page or so.
If you’re interested in more details of how we fixed this and a few other elasticsearch issues, you may want to take a look at the code we worked on – one of the best things about working with the GOV.UK team is that it was already up as open source software within a day (yes, you read that right – code paid for by the taxpayer is open source, as it should be!). We’re looking forward to launch day!
Update: changed ‘proximity search’ to ‘proximity boost’ – thanks Alan!
It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.
When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.
So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.
We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.
The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.
The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8″). This seemed to be the most likely cause of poor performance under load.
Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.
What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.
In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!
Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.
There have been some very encouraging noises recently about increased use of open source software by the UK Government: for example we’ve seen the creation of an Open Source Procurement Toolkit by the Cabinet Office, which lists Xapian and Apache Lucene/Solr as alternatives to the usual closed source options. The CESG, the “UK Government’s National Technical Authority for Information Assurance”, has clarified its position on open source software, which has led to the Cabinet Office dispelling some of the old myths about security and open source. We know that the Cabinet Office’s ’skunkworks’, the Government Digital Service, are using Solr for several of their projects. Francis Maude MP was recently in the USA with some of the GDS team and visited amongst others our US partners Lucid Imagination.
The British Computer Society have helped organise a series of Awareness Events for civil servants and I’m glad to be speaking at the first of these next Tuesday 21st February on open source search – hopefully this will further increase the momentum and make it even more clear that a modern Government needs to consider this modern, flexible and economically scalable approach to software.
I spent yesterday evening at the British Computer Society on the panel of an event organised by the Open Source Specialist Group, nominally discussing the skills required to build Content Management Systems (CMS) using open source software (OSS). We heard a lot about a the features and advantages of CMS such as Joomla, Drupal and Plone and the document management system Alfresco, and I contributed some details of Apache Lucene/Solr and Xapian which can be used in concert with all of these systems (and are usually available as plug-in modules).
We also considered how best to encourage the further use of OSS within the UK government, and I’ve tried to list some of the suggestions that were made – this is in no way a complete list, but it’s a start.
- Look at what has been done with OSS in other countries in the government sector – e.g. the PloneGov initiative. A lot of this knowledge and expertise should be transferable.
- Publicise current use within government – we all know that OSS is already being used on government websites and intranets, but if this can be more widely known it will encourage further use of OSS within the sector. We hear that there are already ’skunkworks’ teams in government using open source and open standards – make sure we hear more about what they build.
- Support the open source projects themselves – this could be by contributing code developed within government back to OSS projects, or by supporting the open source community in other ways – for example, funding the creation of better documentation, or making it easier to run open source conferences (perhaps with the help of local goverment).
- Improve the procurement process to better understand open source as a viable alternative and to ease its adoption (for example, many open source companies are smaller than closed source vendors and thus less able to engage in lengthy and expensive procurement rounds).
- Understanding that comparing OSS to a closed source product is often like comparing apples to oranges – OSS provides a highly flexible toolkit where the user chooses what features they want, as opposed to a closed source product where feature sets are fixed by the vendor. During procurement, simple ‘check box’ lists of required features are thus not always applicable.
- Listen more to OSS experts and bringing them into goverment to help educate and inform.
We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).
Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.
We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.
We’ll be attending the Guardian’s Public Procurement Show on June 14th & 15th as part of the Open Goverment stand – with the recent release by the UK government Cabinet Office of a new IT strategy (here are some industry reactions) it will be interesting to see whether anyone still believes the FUD about open source in the face of the evidence.
We’re also organising another search meetup in Cambridge on April 5th, this time featuring two perspectives on learning, and will also be at a more informal gathering of open source search people on May 3rd.
There’s a lot of buzz currently around the UK government and its approach to IT projects (which has been historically rather poor in terms of delivery, schedules and cost). We’ve written before about an Action Plan that recommends open source and open standards, but it seems that actually implementing these is more of a problem, especially when you consider (flexible and more agile) smaller suppliers such as ourselves who may not even get a chance to compete for the business.
There’s an inquiry running currently that promises to look at this, and they have invited various people to put their views across. Unfortunately with one laudable exception these people were from (or mainly represent) very large IT companies who already supply the government and whose interest lies in maintaining the status quo.
As Mark Taylor of Sirius has already pointed out, this situation isn’t going to change until government procurement itself becomes an open process, so that we can all see how much could be wasted on outdated project management methods and overpriced closed source software.
I’ve been reading the revised Open Source, Open Standards and ReUse: Government Action Plan – it’s surprising (and heartening) to see this has existed in one form or another since as far back as 2004.
The key changes for this version are:
suppliers have to show evidence they’ve considered open source options – hopefully this will be more than a quick trawl through SourceForge
’shadow license costs’ have to be shown in calculations to take account of previous purchases of ‘perpetual’ licenses – apparently in some cases this could make software license fees for a project appear as zero!
all purchases have to be on the basis of of re-use across the government sector – so no need to pay again if a system moves to the cloud in the future
This all sounds great for the open source community; let’s also hope that increased openness in government means that we’ll be able check the Action Plan is actually being followed!
By the way a great example of open source in action on government data is They Work For You, which cleans up Hansard and makes more accessible – search is powered by Xapian.