website – Flax

Why GCloud search is badly broken & how to fix it

Charlie Hull — Thu, 26 Jun 2014 15:26:23 +0000

The GCloud initiative and the associated CloudStore are a great idea – hoping to level the field of UK government IT supply, take advantage of flexible and agile delivery of software and services and help SMEs like ourselves compete against the large System Integrators (SIs) that dominate this market. GCloud sales have now reached £154m although this is still a fraction of what the UK government spends on IT. We’re on GCloud 5 ourselves by the way so I have a vested interest in helping potential customers find us, and we’ve helped with government systems before.

Unfortunately the Cloudstore itself has a search facility that is badly broken. There are several obvious issues: many of the entries created by the larger suppliers have been keyword stuffed – here’s a particularly egregious example from Atos which seems to include most of the terms used in software in the last few years. I found this using the search terms ‘enterprise search’ which produces very few relevant looking results. The online guidance for CloudStore search suggests putting double quotes around my terms (sadly I think few users will think of this) which improves things a little but there are still a lot of irrelevant results – an online conferencing system is fifth for example.

Fortunately all is not lost and in the next iteration of GCloud we are promised major improvements to the search engine. I’m hoping this will include phrase boosting. However, if the big SIs and others are allowed to create the sort of bad-quality content I have shown above, no search engine in the world will be able to sort the wheat from the chaff. It is essential that CloudStore entries are subject to some kind of curation and that keyword stuffing is banned and/or heavily penalised, otherwise SMEs like ourselves will still find it very hard to compete with the big SIs.

Update: it seems there is a new system under construction, and the search works a lot better. Let’s hope it comes out of alpha soon and can be used by purchasers!

The post Why GCloud search is badly broken & how to fix it appeared first on Flax.

Better search for e-petitions – handling misspelled content with a Solr phonetic filter

Tom — Thu, 24 May 2012 09:40:41 +0000

We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.

The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.

The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8”). This seemed to be the most likely cause of poor performance under load.

Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.

What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.

In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!

Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.

The post Better search for e-petitions – handling misspelled content with a Solr phonetic filter appeared first on Flax.

Cambridge Search Meetup review – Two different kinds of university search

Charlie Hull — Thu, 08 Dec 2011 10:38:18 +0000

James Alexander of the Open University talked first on the Access to Video Assets project, a prototype system that looked at preservation, digitisation and access to thousands of TV programs originally broadcast by the BBC. James’ team have worked out an approach based on open source software – storing programme metadata and video assets in a Fedora Commons repository, indexing and searching using Apache Solr, authentication via Drupal – that is testament to the flexibility of these packages (some of which are being used in non-traditional ways – for example Drupal is used in a ‘nodeless’ fashion). He showed the search interface, which allowed you to find the exact points within a long video where particular words are mentioned and play video directly with a pop-up window. I’d seen this talk before (here’s a video and slides from Lucene Eurocon) but what I hadn’t grasped is how Solr is used as a mediation layer between the user and what can be some very complex data around the video asset itself (subtitles, rights information, format information, scripts etc.). As he mentioned, search is being used as a gateway technology to effective re-use of this huge archive.

Udo Kruschwitz was next with a brief treatment of his ongoing work on automatically extracting domain knowledge and using this to improve search results (for example see the ‘Suggestions’ on the University of Essex website) – he showed us some of the various methods his team have tried to analyze query logs, including Ant Colony Optimisation (modelling ‘trails’ of queries that can be reinforced by repeat visits, or ‘fade’ over time as they are less used). I liked the concept of developing a ‘community’ search profile where individual search profiles are hard to obtain – and how this could be simply subdivided (so for example searchers from inside a university might have a different profile to those outside). The key idea here is that all these techniques are automatic, so the system is continually evolving to give better search suggestions and hints. Udo and his team are soon to release an open source adaptive search framework to be called “Sunny Aberdeen” which we look forward to hearing about.

The evening ended with networking and a pint or two in traditional fashion – thanks to both our speakers and to all who came, from as far afield as Milton Keynes, Essex and Luton. The group now has 70 members and we’re building an active and friendly local community of search enthusiasts.

The post Cambridge Search Meetup review – Two different kinds of university search appeared first on Flax.

Website Redesign

Charlie Hull — Wed, 12 May 2010 15:15:19 +0000

We’ve now completely redesigned the Flax website – we hope you like it. We’ve tried to focus more on explaining exactly what we do and how the Flax open source search platform might be able to help your business.

Of course, there are sure to be teething problems – if you find anything that doesn’t work do let us know!

The post Website Redesign appeared first on Flax.

Migrating from lemurconsulting.com

Tom — Wed, 06 May 2009 15:32:51 +0000

We finally decided to move entirely to flax.co.uk. The one page remaining is the news archive.

The post Migrating from lemurconsulting.com appeared first on Flax.

More technical details now available

Charlie Hull — Thu, 05 Feb 2009 16:22:44 +0000

Based on some feedback, we’ve made some more technical details about Flax available on our Features page. You can download the PDF here.

The post More technical details now available appeared first on Flax.