Archive for June, 2011

How to remove a stored field in Lucene

While working on a customer project recently we found a very large field that was stored unnecessarily in the Lucene index, taking up a lot of space. As it would have taken a very long time to re-index (there are tens of millions of complex documents in this case) we looked for a way to remove the stored field in-place.

There’s an interesting set of slides from last year’s Apache Lucene Eurocon which discuss this kind of Lucene index post-processing, but we didn’t find any tools to do this particular task (although this doesn’t mean they don’t exist – for example Luke may be helpful). So we wrote our own, based on some examples in the ‘contrib’ directory of Solr 4. We override the document() methods of FilterIndexReader to remove the required field from each returned Document’s field list. Terms aren’t interfered with, so it really is like changing the field from being stored to not being stored; it’s still indexed.

The code is available here. It’s written against Lucene 2.9.3 (which is contained in Solr 1.4.1).

Tags: , , , ,

Posted in Technical

June 24th, 2011

No Comments »

Cambridge Search Meetup – Flow in Search UX and TrueKnowledge

The Cambridge Enterprise Search Meetup last night featured Francis Rowland of the European Bioinformatics Institute and Rob Stacey of TrueKnowledge, in a newly refurbished venue. Thanks to all those who came and it was good to meet some new faces.

Francis talked about how search user interfaces should try not to restrict the user’s ‘flow’ of activity, as search is after all only a means to and end. Among the wealth of material he mentioned was the Endeca User Interface Design Pattern Library and what is sure to be a very useful upcoming book, Search Analytics for Your Site.

Rob told us about how TrueKnowledge provides a semantic question answering system – trying to understand the goal(s) of someone asking the system a question such as “is Madonna single?”. He also mentioned how this kind of technology might be applied to an enterprise environment, for example to answer questions like “has the invoice for last Thursday’s job been paid?”. Rob’s talk sparked off a very active Q&A session, with the audience raising issues such as how TrueKnowledge’s method might be applied to languages other than English and how to model the trustworthiness of their sources, which include Wikipedia.

Francis’ slides are now online – with some great sketchnotes of Rob’s talk as well! Thanks to both our speakers.

Whitepaper – Why you should be considering open source search

I’ve uploaded a whitepaper I wrote a short while ago :

“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”

It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.

Encouraging the use of open source software in government

I spent yesterday evening at the British Computer Society on the panel of an event organised by the Open Source Specialist Group, nominally discussing the skills required to build Content Management Systems (CMS) using open source software (OSS). We heard a lot about a the features and advantages of CMS such as Joomla, Drupal and Plone and the document management system Alfresco, and I contributed some details of Apache Lucene/Solr and Xapian which can be used in concert with all of these systems (and are usually available as plug-in modules).

We also considered how best to encourage the further use of OSS within the UK government, and I’ve tried to list some of the suggestions that were made – this is in no way a complete list, but it’s a start.

  • Look at what has been done with OSS in other countries in the government sector – e.g. the PloneGov initiative. A lot of this knowledge and expertise should be transferable.
  • Publicise current use within government – we all know that OSS is already being used on government websites and intranets, but if this can be more widely known it will encourage further use of OSS within the sector. We hear that there are already ’skunkworks’ teams in government using open source and open standards – make sure we hear more about what they build.
  • Support the open source projects themselves – this could be by contributing code developed within government back to OSS projects, or by supporting the open source community in other ways – for example, funding the creation of better documentation, or making it easier to run open source conferences (perhaps with the help of local goverment).
  • Improve the procurement process to better understand open source as a viable alternative and to ease its adoption (for example, many open source companies are smaller than closed source vendors and thus less able to engage in lengthy and expensive procurement rounds).
  • Understanding that comparing OSS to a closed source product is often like comparing apples to oranges – OSS provides a highly flexible toolkit where the user chooses what features they want, as opposed to a closed source product where feature sets are fixed by the vendor. During procurement, simple ‘check box’ lists of required features are thus not always applicable.
  • Listen more to OSS experts and bringing them into goverment to help educate and inform.

Tags: , ,

Posted in events

June 10th, 2011

No Comments »

Open source in the UK

We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).

Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.

We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.