Posts Tagged ‘government’

How we built a search engine for UK MP tweets with Solr, Python & StanfordNLP

Matt Pearce writes:

We recently released UKMP, a search application built on work done on last year’s Enterprise Search hack day. This presents the tweets of UK Members of Parliament with search options including filtering by party, retweet and favourite count, and entities (people, locations and organisations) extracted from the tweet text. This is obviously its first incarnation, so there are still a number of features in development, but I thought I would comment on some of the decisions taken while developing the site.

I started off by deciding which bits of the hack day code would be most useful, from both the Solr set-up side and the web application we were hoping to build. During the hack day, the group had split into a number of smaller teams, with two of them working on a set of data downloaded from Twitter, containing the original set of UK MP tweets. I took the basic Solr setup and indexing code from one group, and the initial web application from the other.

Obviously we couldn’t work with a completely static data set, so I set about putting together a Python script to grab the tweets. This was where I met the first hurdle: I was trying to grab tweets from individual MPs’ feeds, but kept getting blocked by the Twitter API, even though I didn’t think I was over-stepping the limits set on the calls. With 200-plus MPs to track, a different approach would be required to avoid being blocked. Eventually, I took a different approach, and started using the lists compiled by Tweetminster, who track politicians tweets themselves. This worked much better, and I could soon start building a useful data set.

I chose the second group’s web application because it already used the Stanford NLP software to extract entities from the tweet text. The indexer script, also written in Python, calls the web app to extract the entities before indexing the tweets. We spent some time trying to incorporate the Stanford sentiment analysis as well, but found it wasn’t practical – the response time was too slow, and we didn’t have time to train the dataset to provide a more useful analysis of the content (almost all tweets were rated as either “negative” or “neutral”, which didn’t accurately reflect the sentiments in the data).

Since this was an entirely new project, and because it was being done outside the main client workflow, I took the opportunity to try out AngularJS, an MVC-oriented JavaScript front-end framework. This runs on top of, and calls back to, the DropWizard web application, which provides the Model part of the Model-View-Controller system. AngularJS itself provides the Controller, while the Views are all written in fairly standard HTML, with some AngularJS frosting to fill in the content.

AngularJS itself generally made development very easy and fast, and I was pleased by how little JavaScript I had to write to build a working application (there is also a Bootstrap crossover module, providing AngularJS directives to work with the UI layout tools Bootstrap provides). As a small site, there are only two controllers in play: one for each page. AngularJS also makes it very easy to plug in other script modules, such as that used to generate the word cloud on the About page. However, I did come across a few sticking points as I built the app, as one might expect from a first-time user. The principle one was handling the search box at the top of the page, which had to be independent of the view while needing to modify it to display the search results. I am still not sure that I ended up with the best approach – the search form fires an event when submitted, which then percolates up the AngularJS control hierarchy until caught and dealt with: within the search page, the search is handled normally; from other pages, we redirect to the search page and pass in the term. It doesn’t feel as smooth as it should do, which is why I remain unconvinced this is the best solution.

All in all, this was an interesting sideline project, and provided a good excuse to try out some new technology. The code itself, along with some notes on how to get the system up and running, is in our github repository – feel free to try it out, and make suggestions for improvements or better ways to use the code.

G-Cloud and open file formats, a cautionary tale

We’re lucky enough to have our services available on the G-Cloud, a new initiative by the UK Government’s Cabinet Office with the aim of breaking the sometimes monopolistic practices of ‘big IT’ when supplying government clients. We’ve recently had a couple of contracts procured via the G-Cloud iii framework and one of the requirements is to report whenever a client is invoiced. This is done via a website called Management Information Systems Online (MISO).

Part of the process is to input various mysterious Product Codes, and to find out what these were I downloaded a file from the MISO website. I use the Firefox browser and OpenOffice so I had assumed that opening this file would be a relatively simple process…perhaps unwisely.

Firstly, due to some quirk of the website and/or browser the file arrives with no file extension. I’m assuming it’s some kind of Microsoft Office document so I try renaming it to .xls as an Excel spreadsheet, and open it in OpenOffice Calc. This doesn’t work, as I end up with a load of XML in the spreadsheet cells. As it’s XML I wonder if it’s a newer, XML-powered Office format, so rename to .xlsx, but no, it seems that doesn’t work either. Opening up the file in a text editor shows it’s some kind of XML with Microsoft schemas abounding. At this point I tried contacting the MISO technical support department but they weren’t able to help.

A quick Google and I’ve discovered that the file is probably SpreadsheetML, a file format used before 2007 when Microsoft finally went the whole hog and embraced (well, forced everyone else to embrace) their own XML-based standard for Office documents. The latter format is something OpenOffice can easily read, so I try renaming the file as .xml and importing it. OpenOffice now tells me "OpenOffice.org requires a Java runtime environment (JRE) to perform this task. The selected JRE is defective."

This is now taking far too long. After some more research I discover what this actually means is OpenOffice needs a version of Java 6 (now discouraged by Oracle). I have to register for an Oracle account to even download it. Finally, Open Office is able to read the file and I can now fill in the original form.

If anything this process proves that central government has a long way to go towards adopting open standards and using plain, widely adopted file formats. The G-Cloud framework is a great step forward – but some of the details still need some work.

An open approach to tuning search for gov.uk

Roo Reynolds from the GDS team has written a great blog post about the ongoing process of tuning the search for gov.uk which I can highly recommend.

We regularly see situations where a search project has been set up as ‘fire and forget’ – which is never a good idea: not only does content grow, but user needs change and search requirements evolve, whatever the application. Search should be a living project: monitoring user behaviour should reveal not just which searches ‘work’ (i.e. the user gets some results which they then click on) but more important which ones don’t. For example, common mispellings or acronyms might be a useful addition to a synonym list; if average search response times are lengthening then it might be time to consider performance tuning or even scaling out; the constant use of the ‘Next 10 Results’ button might indicate a problem with relevance ranking.

Luckily any improvements to gov.uk made by the GDS team should appear in their Github repository at some point – as I mentioned before the GDS team are (very sensibly) committed to an open source approach.

Tags: , , ,

Posted in Reference, Technical

June 12th, 2013

No Comments »

A revolution in open standards in government

Something revolutionary has been happening recently in the UK government with regard to open source software, standards and data. Change has been promised before and some commentators have been (entirely correctly) cynical about the eventual result, but it seems that finally we have some concrete results. Not content with a public policy and procurement toolkit for open source software, the Cabinet Office today released a policy on open standards – and unlike many had feared, they have got it right.

Why do open standards matter? Anyone who has attempted to open a Word document of recent vintage in an older version of the same software will know how painful it can be. In the world of search we often have to be creative in how we extract data from proprietary, badly documented and inconsistent formats (get thee behind me, PDF!) – at Flax we came up with a novel method involving a combination of Microsoft’s IFilters and running Open Office as a server (you can download our Flax Filters as open source if you’d like to see how this works). If all else fails it is sometimes possible to extract strings from the raw binary data. However, we generally don’t have to preserve paragraphs, formatting and other specifics – and that is the kind of fine detail that often matters, especially in the government or legal arena. Certain companies have been downright obstructive in how they define their ’standards’ (and I use that word extremely loosely in this case). The same companies have been accused by many of trying to influence the Cabinet Office consultation process, introducing the badly defined FRAND concept. However, the consultation process has been carefully and correctly run and the eventual policy is clear and well written.

It will be very interesting to see how commercial closed source companies react to this policy – but in the meantime those of us in the open source camp should be cheered by the news that finally, after many false starts and setbacks, ‘open’ really does mean, well, ‘open’.

Tags: , ,

Posted in News

November 2nd, 2012

No Comments »

Tuning and improving elasticsearch for the Government Digital Service

The exciting GOV.UK project is getting close to its first release date of October 17th and we were asked by them to help with some search tuning as they migrate from Apache Solr to elasticsearch. Although elasticsearch has some great features there are still some areas where it lags Solr, such as the lack of spelling suggestion and proximity boost features. Alan from Flax spent a couple of days working with the GDS team and has blogged about how proximity boosting in particular can be implemented – at least for terms that are relatively close to each other rather than being separated by a page or so.

If you’re interested in more details of how we fixed this and a few other elasticsearch issues, you may want to take a look at the code we worked on – one of the best things about working with the GOV.UK team is that it was already up as open source software within a day (yes, you read that right – code paid for by the taxpayer is open source, as it should be!). We’re looking forward to launch day!

Update: changed ‘proximity search’ to ‘proximity boost’ – thanks Alan!

Tags: , , , ,

Posted in Technical

October 1st, 2012

3 Comments »

Eleven years of open source search

It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.

When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.

So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.

Better search for e-petitions – handling misspelled content with a Solr phonetic filter

We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.

The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.

The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8″). This seemed to be the most likely cause of poor performance under load.

Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.

What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.

In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!

Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.

Tags: , , , ,

Posted in News, Technical

May 24th, 2012

2 Comments »

Searching for (and finding) open source in the UK Government

There have been some very encouraging noises recently about increased use of open source software by the UK Government: for example we’ve seen the creation of an Open Source Procurement Toolkit by the Cabinet Office, which lists Xapian and Apache Lucene/Solr as alternatives to the usual closed source options. The CESG, the “UK Government’s National Technical Authority for Information Assurance”, has clarified its position on open source software, which has led to the Cabinet Office dispelling some of the old myths about security and open source. We know that the Cabinet Office’s ’skunkworks’, the Government Digital Service, are using Solr for several of their projects. Francis Maude MP was recently in the USA with some of the GDS team and visited amongst others our US partners Lucid Imagination.

The British Computer Society have helped organise a series of Awareness Events for civil servants and I’m glad to be speaking at the first of these next Tuesday 21st February on open source search – hopefully this will further increase the momentum and make it even more clear that a modern Government needs to consider this modern, flexible and economically scalable approach to software.

Tags: , , , , , , ,

Posted in News, events

February 17th, 2012

No Comments »

Encouraging the use of open source software in government

I spent yesterday evening at the British Computer Society on the panel of an event organised by the Open Source Specialist Group, nominally discussing the skills required to build Content Management Systems (CMS) using open source software (OSS). We heard a lot about a the features and advantages of CMS such as Joomla, Drupal and Plone and the document management system Alfresco, and I contributed some details of Apache Lucene/Solr and Xapian which can be used in concert with all of these systems (and are usually available as plug-in modules).

We also considered how best to encourage the further use of OSS within the UK government, and I’ve tried to list some of the suggestions that were made – this is in no way a complete list, but it’s a start.

  • Look at what has been done with OSS in other countries in the government sector – e.g. the PloneGov initiative. A lot of this knowledge and expertise should be transferable.
  • Publicise current use within government – we all know that OSS is already being used on government websites and intranets, but if this can be more widely known it will encourage further use of OSS within the sector. We hear that there are already ’skunkworks’ teams in government using open source and open standards – make sure we hear more about what they build.
  • Support the open source projects themselves – this could be by contributing code developed within government back to OSS projects, or by supporting the open source community in other ways – for example, funding the creation of better documentation, or making it easier to run open source conferences (perhaps with the help of local goverment).
  • Improve the procurement process to better understand open source as a viable alternative and to ease its adoption (for example, many open source companies are smaller than closed source vendors and thus less able to engage in lengthy and expensive procurement rounds).
  • Understanding that comparing OSS to a closed source product is often like comparing apples to oranges – OSS provides a highly flexible toolkit where the user chooses what features they want, as opposed to a closed source product where feature sets are fixed by the vendor. During procurement, simple ‘check box’ lists of required features are thus not always applicable.
  • Listen more to OSS experts and bringing them into goverment to help educate and inform.

Tags: , ,

Posted in events

June 10th, 2011

No Comments »

Open source in the UK

We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).

Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.

We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.