Archive for October, 2010

Search briefing

Just a quick post to note I’ll be presenting at Ovum’s ‘Search Across the Enterprise’ event on Wednesday 3rd November in London. I’ll be talking about open source search for enterprise applications, dispelling some myths and describing some case studies.

Tags: ,

Posted in events

October 27th, 2010

No Comments »

Building a new press cuttings service for the Financial Times

Those of you who read my slides from Search Solutions 2010 will have spotted a case study on our work for the Financial Times, one of the world’s leading business news organisations.

When the Financial Times decided to bring their digital press cuttings in-house in summer 2010, they asked us to build a powerful ’search server’ that they could easily integrate into their existing product offerings.

We built an indexer for the XML source data and a RESTful Web Service API, offering search features including Boolean operators, phrase searches, area specifiers (search whole article, body, headline, byline or any combination), date range restrictions, similarity search (“articles like this one”) and faceted search. Also available is spelling correction and synonyms, and detailed logging of indexing and all searches.

This might sound like a complex task, but using open source technology we created this system within less than a fortnight. Initially designed as a small-scale prototype, the system scaled easily to indexing hundreds of thousands of pages. You can use the service at

Tags: , , , ,

Posted in News, Technical

October 25th, 2010

1 Comment »

Search Solutions 2010 – a brief review

I spent yesterday at Search Solutions 2010, hosted by the British Computer Society. They’d been kind enough to ask me to speak (Update: my slides are available here, the rest are available at the event website above), but there were plenty of other people to listen to as well. There’s a great blow-by-blow account from Tyler Tate already, but here are some personal highlights:

Google’s Behshad Behzadi spoke about freshness for web content and how Google’s usual ranking strategy favours older results over new ones – as the new ones don’t have so many links. Vishwa Vinay from Microsoft talked on what to do with click data in enterprise search – he listed lots of papers on the subject, hopefully his slides will be published so we can follow them up. He made the point that any ‘adaptive’ ranking based on click data must still work well out of the box, before any clicks have happened. This section of the event finished with Vivian Lin Dufour of Yahoo!, talking about some ways of guiding searchers from within the UI, with auto-suggest and similar techniques. Apparently the research the Yahoo team are doing on trending has let them spot news stories 12-24 hours before they hit the papers. I wondered afterwards, is this current fad for ‘trendspotting’ turning search engines into just a media channel? I don’t care much about the X-Factor TV show myself, so why should this current trend influence my search results?

Nick Patience started the next session talking about trends in the Enterprise Search market: he acknowledged the rapid rise of open source solutions and talked about how search-based applications will become increasingly important, with a huge market for ‘information governance’ solutions opening up. Chirag Ghandhi of Mphasis, a search integrator, talked about how customers are disillusioned with enterprise search, and how difficult it is to build solutions that cope with data from a range of different sources and in different languages. Dusan Rnic of Endeca stressed the importance of being able to handle the ‘long tail’ of search results – the ones that aren’t the most popular and showed us his favourite website – strangely enough, an Endeca customer.

Greg Lyndahl talked about how Blekko have built an innovative web crawling/indexing framework, which has enabled them to build up a 3 billion page index very efficiently – looking forward to seeing more of this. As he said, what they’re doing isn’t necessarily better than Google, but it’s certainly different. My talk on open source search for news content followed, and then Roberto Cornacchia showed us Spinque’s approach to building search platforms – encapsulating search expert knowledge into logical ‘blocks’ that can be combined by domain experts into the solutions they actually need.

The last session began with Till Kinstler of GBV Common Library Network, a self-described ‘library hacker’, on building a search system using the open source engine Solr over 25 million library records – they’re now aiming for 120 million, taken from 400 different libraries, in source formats going all the way back to tape and paper library cards! We then heard about the Information Retrieval Facility, an open IR research institution – I liked their three principles of ‘open science, open source, open market’. The talks finished with Rob Stacey on True Knowledge’s ways of checking the veracity of facts gathered from the internet.

We then moved on to an open panel – some great themes here including the rise of search as a platform for new applications, what exciting (or scary) things Facebook might bring to the world of search, and how we should all work harder to bring good information retrieval mechanisms to those who cannot currently access them due to poverty, language barriers or disability.

Thanks to the BCS IRSG and in particular to Udo Kruschwitz for a very interesting and enlightening day.

Tags: , , , ,

Posted in News, events

October 22nd, 2010


When search isn’t just search at The Guardian

A fascinating event last night as the Guardian team told us more about how they’ve used open source search technology to build their new open platform. The presentations were brief and to-the-point, and covered how the team have created a detailed, rich API to their news content, all built on the open source engine Apache Solr – opening up Guardian Media Group content to the world for mashups, repurposing and innovative new business models.

The Guardian have an existing Oracle database with J2EE web applications to serve content, but discovered that certain operations such as returning content with multiple tags, or dynamically generated ‘related’ content, were very database-intensive and difficult to scale. The use of Solr effectively flattens the cost of these complex queries, and also allows them to scale up capacity on demand by simply spinning up more Solr instances on the Amazon EC2 cloud . Interestingly, site search for the Guardian website doesn’t yet use Solr, although they hope to move this across soon.

What we’re seeing here is a change in how search technology is used especially by forward-looking organisations – from being a bolt-on to an existing website or application, search is now the platform for new developments. I’ll be talking about other ways open source search has been used for news content at the British Computer Society this coming Thursday 21st October – I believe there are still a few places available.

Tags: , , , ,

Posted in Technical, events

October 19th, 2010


Further revolutions

Back for the second day of Lucene Revolution, with some great talks on migrating to Solr from FAST ESP, the new flexible indexing features coming to Lucene ‘real soon now’, and finishing off with a panel discussion. I felt privileged to sit as part of this panel between Eric Gries, CEO of Lucid Imagination, and Paul Doscher of Exalead – the discussion was lively and interesting (I hope!) to the audience.

I’m looking forward to returning to the UK with all I’ve learnt from this event, and to follow up on some of the ideas generated – for example, it would be great to be able to demonstrate Lucid Works Enterprise to interested parties in London.

Thanks to Stephen Arnold’s team and all at Lucid Imagination for organising such a great conference. It won’t be the last I’m sure!

Tags: , , , ,

Posted in events

October 8th, 2010

No Comments »

A revolution indeed

I’m at the Lucene Revolution conference in Boston, USA for the next few days – and it’s aptly named. If there’s anyone out there who still doubts that open source search is a serious alternative to a commercial engine, the numbers and other information coming out of this event will be convincing. Twitter are now using Lucene to handle a billion queries a day; LinkedIn and are already veterans with similarly huge installations. The conversations I’m having and overhearing are about billions of documents, tens of thousands of users, all easily handled by open source search.

The other big news here is that Lucid Imagination have released software to fill in most if not all of the gaps between Lucene/Solr and the closed-source competition – it’s called LucidWorks Enterprise and adds a detailed administration UI, a REST API, crawlers, scaling functionality and much more. I’m looking forward to getting my hands on a demo and showing it off when back in the UK.

There’s an optimistic, buzzing energy at this event – a real feeling that we’re here at the beginning of something big. More revolutionary news to come!

Tags: , , ,

Posted in News, events

October 7th, 2010

1 Comment »

Flax partners with Lucid Imagination

We’re very happy to announce that we’ve been selected as an Authorized Partner by Lucid Imagination, the commercial company for Lucene and Solr. You can read the press release as a PDF here.

Apache Lucene and Solr, available as open source software from the Apache Software Foundation, are powerful, scalable, reliable and fully-featured search technologies. Solr is the Lucene Search Server, making it easy to build search applications for the enterprise.

With our long experience of customising, installing and supporting open source search engines, this partnership is a natural fit for us, and we’re excited by the opportunities it presents. In addition to our current offerings, Flax will now offer installation, integration and commercial support packages for Lucene and Solr, backed by Lucid Imagination.

Tags: , , , ,

Posted in Business, News

October 4th, 2010

No Comments »