Archive for December, 2010

The year open source search got serious

It’s been an interesting and busy twelve months here at Flax – we’ve worked on some fantastic customer projects, spoken at conferences at home and abroad and made some great alliances and partnerships. We are talking to more people than ever before about the advantages of open source search and we’ve even started a local Meetup group.

This has been the year when open source search moved out of the shadows and became a force to reckon with – whether handling billions of queries or millions of customers, powering innovative new APIs for open content from forward-looking media companies or simply making it easier for search applications to be developed. Commercial support is now available to rival anything offered by the closed source world and there are now fully packaged solutions built on open source. In some sectors open source may even become the default choice (see what IDC said about the embedded/OEM market).

There’s still significant change to come in the search sector – I expect a few vendors will be in trouble by this time next year as they realise their business models (often built on per-document charges) are out-of-date, and we might also see further acquisitions by the usual behemoths. All this leads to reduced choice and increased costs for customers, and this is where open source can help – you can build your search solution in-house, or engage companies like ours to help, but you’re no longer locked in to a vendor’s roadmap and shackled to their business plan (or the consequences of its failure!).

I’ll leave the final word to Matt Asay of Canonical, who says: “Open source is how we do business 10 years into this new millennium.”

Next-generation media monitoring with open source search

Media monitoring is not a traditional search application: for a start, instead of searching a large number of documents with a single query, a media monitoring application must search every incoming news story with potentially thousands of queries, searching for words and terms relevant to client requirements. This can be difficult to scale, especially when accuracy must be maintained – a client won’t be happy if their media monitors miss relevant stories or send them news that isn’t relevant.

We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).

The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.

As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.

Tags: , , , , , ,

Posted in News

December 13th, 2010

No Comments »

Chalk and cheese – the difficulty of analysing open source options

David Fishman of Lucid Imagination has blogged on how open source search is treated by the analyst community (you can even use his links to get hold of some of the reports mentioned for the usual price of your contact details). We can add to his list a report from the Real Story Group – and I hear Ovum will shortly release an updated report.

What I find most interesting about these analyst reports is how various vendors are subdivided – either by target market, or by size, or by how ‘complex’ their platform is. Open source solutions don’t always fit the categories – for example Real Story Group list ‘Apache Project’ as a ’specialised vendor’ – which it really isn’t. Perhaps it’s time for some new categories in these analyst reports – maybe a list of specialist open source integrators, linked with the available technologies such as Lucene, Xapian or Sphinx, combined with some data about likely costs.

Tags: , , , ,

Posted in Reference

December 9th, 2010

No Comments »

How not to make the same mistake twice

We’ve been aware that some FAST customers will be considering migration for a while now – but Autonomy have finally caught up.

However, if you migrate from one closed source solution to another, how can you guarantee that the same sort of events that have led to the current situation won’t happen again? With open source, there’s no vendor lock-in, a wide choice of companies to assist you with development an integration, a wealth of different support options and of course no license fees to pay. Migrating from FAST is a common topic at conferences at the moment – read Jan H√łydahl’s presentation, or see Michael McIntosh’s video. There are even open source document processing pipeline frameworks to replace the popular FAST one, and we’ve been evaluating some alternative language processing frameworks. Scaling isn’t an issue and some cases you could significantly reduce your hardware budget.

Tags: , , ,

Posted in Uncategorized

December 6th, 2010

No Comments »

Online Information 2010 – it’s quiet, too quiet

We dropped in to the Online 2010 event at Olympia this week, and were immediately struck by how quiet the event was: yes, there’s been some terrible weather recently in the UK but there were fewer stalls than last year, a smaller exhibition space and very few exhibitors in the enterprise search space – no Autonomy, Google, Vivisimo or Endeca for example. Unlike previous years there was no dedicated ’search’ area on the exhibition floor, and we did see a few unmanned stands from mid afternoon. Is this is a sign of difficult times or of an event that needs a rethink about its focus?

We didn’t attend the conference that runs next to the exhibition hall this year. This report on the closing panel shows that one question to the panel was about the rise of open source search – not surprisingly, the panel members (all being from closed source companies) weren’t very enthusiastic about this. According to Autonomy open source is only for the commodity end of the market, which is the smallest part. I’m not sure Twitter (1 billion queries a day), LinkedIn (30 million users), The Guardian (innovative open platform) or the Financial Times would agree…

Intranet search event

Intranet Search was the theme for a small gathering last night at the (rather imposing) Ministry of Justice in London. We heard from Luke Oatham on intranet search at the Ministry itself, powered by Google over a reasonably small set of static and hand-published HTML. Simon Thompson continued with a neat way of enhancing Sharepoint search, using JQuery to create an auto-complete tool for his company intranet, which interestingly displayed both ‘people’ and ‘page’ results in the same drop-down menu. Tyler Tate couldn’t make it to the event due to bad weather, but bravely volunteered to present over Skype on a (surprisingly good) 3G connection, and talked about handling diverse data (video, slides). Next up was our very own Tom Mortimer talking about indexing security information (of which more later) and we finished up with a quick talk from Rangi Robinson on the intranet at Framestore, with search powered by the open source Sphinx project.

Thanks to Simon Thompson and Angel Brown for organising the event and inviting us to speak.

Tags: , , ,

Posted in events

December 3rd, 2010

No Comments »

Enterprise Search Meetup: exploratory search, TravelMatch and Stephen Arnold

Last night I went to another excellent Enterprise Search London Meetup, at Skinkers near London Bridge. I’d been at the Online show all day, which was rather tiring, so it was great to sit down with beer and nibbles and hear some excellent speakers.

Max Wilson kicked off with a talk on exploratory search and ’searching for leisure’. His Search Interface Inspector looks like a fascinating resource, and we heard about how he and his team have been constructing a taxonomy for the different kinds of search people do, using Twitter as a data source.

Martina Schell was next with details of Travel Match, a holiday search engine that’s trying to do for holidays what our customer Mydeco is doing for interior design: scrape/feed/gather as much holiday data as you can, put it all into a powerful search engine and build innovative interfaces on top. They’ve tried various interfaces including a ‘visual search’, but after much user testing have reined back their ambitions somewhat – however they’re still unique in allowing some very complex queries of their data. Interestingly, one challenge they identified is how to inform users that one choice (say, airport to fly from) may affect the available range of other choices (say, destinations) – apparently users often click repeatedly on ‘greyed-out’ options, unsure as to why they’re not working…

The inimitable Stephen Arnold concluded the evening with a realistic treatment of the current fashion for ‘real-time’ search. His point was that unless you’re Google, with their fibre-connected, hardware-accelerated gigascale architecture, you’re not going to be able to do real-time web search or anything close to it; on a smaller scale, for financial trading, military and other serious applications you again need to rely on the hardware – so for proper real-time (that means very close to zero latency), your engineering capability, not your software capability is what counts. I’m inclined to agree – I trained as an electronic engineer and worked on digital audio, back when this was also only possible with clever hardware design. Of course, eventually the commodity hardware gets fast enough to move away from specialised devices, and at this point even the laziest coder can create responsive systems, but we’re far away from that point. Perhaps the marketing departments of some search companies should take note – if you say you can do real-time indexing, we’re not going to believe you.

Thanks again to Tyler Tate and all at TwigKit for continuing to organise and support this excellent event.