Posts Tagged ‘open source’

Flax’s 10th birthday!

Today marks 10 years since we formed Flax (originally as Lemur Consulting Ltd.). We had an idea that search based on open source software was going to be increasingly important (indeed, our original business model was consultancy based on Xapian) and I think we’ve been proved right over the decade. Today, in the depths of a recession, we’re seeing significant growth in the business and some fascinating opportunities: the sector is still going through rapid change and it will be very interesting to see what the next few years bring.

Thanks to all of those who have worked with us and for us over the last decade – we look forward to the next ten years in this exciting field!

Tags: , , ,

Posted in events

July 27th, 2011

No Comments »

Enterprise Search Europe & a SuperSized Search Meetup

We’ve been helping to organise a new conference to be held in London this October, Enterprise Search Europe. This two-day event promises to give a ‘European perspective on the technology, selection, implementation and optimisation of enterprise-scale search’ and features speakers from 3i plc, Logica, The Guardian and a number of search providers such as Findwise, Funnelback and ourselves (I’ll be talking on ‘Building a Strong Business Foundation with Open Source Search’ on the second day).

It’s going to be a busy time as I’m also chairing a panel on the first day and helping run the evening reception, which is co-hosted by the London and Cambridge Search Meetups – this is likely to be one of the largest Search Meetups ever and is sure to be a fascinating evening, featuring speakers from the conference in an informal setting (i.e., a pub!).

Hope to see some of you there.

How to remove a stored field in Lucene

While working on a customer project recently we found a very large field that was stored unnecessarily in the Lucene index, taking up a lot of space. As it would have taken a very long time to re-index (there are tens of millions of complex documents in this case) we looked for a way to remove the stored field in-place.

There’s an interesting set of slides from last year’s Apache Lucene Eurocon which discuss this kind of Lucene index post-processing, but we didn’t find any tools to do this particular task (although this doesn’t mean they don’t exist – for example Luke may be helpful). So we wrote our own, based on some examples in the ‘contrib’ directory of Solr 4. We override the document() methods of FilterIndexReader to remove the required field from each returned Document’s field list. Terms aren’t interfered with, so it really is like changing the field from being stored to not being stored; it’s still indexed.

The code is available here. It’s written against Lucene 2.9.3 (which is contained in Solr 1.4.1).

Tags: , , , ,

Posted in Technical

June 24th, 2011

No Comments »

Whitepaper – Why you should be considering open source search

I’ve uploaded a whitepaper I wrote a short while ago :

“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”

It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.

Encouraging the use of open source software in government

I spent yesterday evening at the British Computer Society on the panel of an event organised by the Open Source Specialist Group, nominally discussing the skills required to build Content Management Systems (CMS) using open source software (OSS). We heard a lot about a the features and advantages of CMS such as Joomla, Drupal and Plone and the document management system Alfresco, and I contributed some details of Apache Lucene/Solr and Xapian which can be used in concert with all of these systems (and are usually available as plug-in modules).

We also considered how best to encourage the further use of OSS within the UK government, and I’ve tried to list some of the suggestions that were made – this is in no way a complete list, but it’s a start.

  • Look at what has been done with OSS in other countries in the government sector – e.g. the PloneGov initiative. A lot of this knowledge and expertise should be transferable.
  • Publicise current use within government – we all know that OSS is already being used on government websites and intranets, but if this can be more widely known it will encourage further use of OSS within the sector. We hear that there are already ’skunkworks’ teams in government using open source and open standards – make sure we hear more about what they build.
  • Support the open source projects themselves – this could be by contributing code developed within government back to OSS projects, or by supporting the open source community in other ways – for example, funding the creation of better documentation, or making it easier to run open source conferences (perhaps with the help of local goverment).
  • Improve the procurement process to better understand open source as a viable alternative and to ease its adoption (for example, many open source companies are smaller than closed source vendors and thus less able to engage in lengthy and expensive procurement rounds).
  • Understanding that comparing OSS to a closed source product is often like comparing apples to oranges – OSS provides a highly flexible toolkit where the user chooses what features they want, as opposed to a closed source product where feature sets are fixed by the vendor. During procurement, simple ‘check box’ lists of required features are thus not always applicable.
  • Listen more to OSS experts and bringing them into goverment to help educate and inform.

Tags: , ,

Posted in events

June 10th, 2011

No Comments »

Open source in the UK

We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).

Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.

We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.

ECIR 2011 Industry day – part 2 of 2

Here’s the second writeup.

We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.

Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.

Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.

I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).

Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.

ECIR 2011 Industry Day – part 1 of 2

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

Events: Open source for government and search in Cambridge

We’ll be attending the Guardian’s Public Procurement Show on June 14th & 15th as part of the Open Goverment stand – with the recent release by the UK government Cabinet Office of a new IT strategy (here are some industry reactions) it will be interesting to see whether anyone still believes the FUD about open source in the face of the evidence.

We’re also organising another search meetup in Cambridge on April 5th, this time featuring two perspectives on learning, and will also be at a more informal gathering of open source search people on May 3rd.

Tags: , , ,

Posted in events

April 1st, 2011

No Comments »

UK Government IT – a closed shop to SMEs and OSS?

There’s a lot of buzz currently around the UK government and its approach to IT projects (which has been historically rather poor in terms of delivery, schedules and cost). We’ve written before about an Action Plan that recommends open source and open standards, but it seems that actually implementing these is more of a problem, especially when you consider (flexible and more agile) smaller suppliers such as ourselves who may not even get a chance to compete for the business.

There’s an inquiry running currently that promises to look at this, and they have invited various people to put their views across. Unfortunately with one laudable exception these people were from (or mainly represent) very large IT companies who already supply the government and whose interest lies in maintaining the status quo.

As Mark Taylor of Sirius has already pointed out, this situation isn’t going to change until government procurement itself becomes an open process, so that we can all see how much could be wasted on outdated project management methods and overpriced closed source software.

Tags: , ,

Posted in News

March 18th, 2011

No Comments »