Posts Tagged ‘guardian’

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

The Fall and rise of search in a world of Big Data – part 2

The theme of Big Data continued at the next conference I attended, the first Enterprise Search Europe held in London. There was a good mix of presentations ranging from the academic to the practical, my favourite probably being Martin Belam and colleague’s talk about using Solr to dynamically generate content for the new Guardian Books site. I was lucky enough to be able to talk about the real business benefits of open source search along with one of our customers, Stephen Wicks, CTO of Gorkana Group, which drew some interesting questions. We also ran a combined Meetup on the Monday evening, combining Enterprise Search Cambridge with Enterprise Search London.

There did seem to be a rather negative spin on search from many presenters – saying that search technology is misunderstood, more costly than expected, rarely works and hasn’t seen much recent innovation. Some of this is true – but I see this as an opportunity rather than a problem. There is more focus on the world of search now than before due to some high-profile acquisitions; people are questioning the value and capability of search technology. Those of us working at the cutting edge, delivering real working solutions, should perhaps take this opportunity to say that yes, it can be done, at a sensible cost, and it can deliver real business benefit. Perhaps as we move further into the world of Big Data we’ll realise the true value of effective search.

Tags: , , , ,

Posted in events

October 31st, 2011

No Comments »

Whitepaper – Why you should be considering open source search

I’ve uploaded a whitepaper I wrote a short while ago :

“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”

It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.

Open source in the UK

We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).

Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.

We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.

Online Information 2010 – it’s quiet, too quiet

We dropped in to the Online 2010 event at Olympia this week, and were immediately struck by how quiet the event was: yes, there’s been some terrible weather recently in the UK but there were fewer stalls than last year, a smaller exhibition space and very few exhibitors in the enterprise search space – no Autonomy, Google, Vivisimo or Endeca for example. Unlike previous years there was no dedicated ’search’ area on the exhibition floor, and we did see a few unmanned stands from mid afternoon. Is this is a sign of difficult times or of an event that needs a rethink about its focus?

We didn’t attend the conference that runs next to the exhibition hall this year. This report on the closing panel shows that one question to the panel was about the rise of open source search – not surprisingly, the panel members (all being from closed source companies) weren’t very enthusiastic about this. According to Autonomy open source is only for the commodity end of the market, which is the smallest part. I’m not sure Twitter (1 billion queries a day), LinkedIn (30 million users), The Guardian (innovative open platform) or the Financial Times would agree…

When search isn’t just search at The Guardian

A fascinating event last night as the Guardian team told us more about how they’ve used open source search technology to build their new open platform. The presentations were brief and to-the-point, and covered how the team have created a detailed, rich API to their news content, all built on the open source engine Apache Solr – opening up Guardian Media Group content to the world for mashups, repurposing and innovative new business models.

The Guardian have an existing Oracle database with J2EE web applications to serve content, but discovered that certain operations such as returning content with multiple tags, or dynamically generated ‘related’ content, were very database-intensive and difficult to scale. The use of Solr effectively flattens the cost of these complex queries, and also allows them to scale up capacity on demand by simply spinning up more Solr instances on the Amazon EC2 cloud . Interestingly, site search for the Guardian website doesn’t yet use Solr, although they hope to move this across soon.

What we’re seeing here is a change in how search technology is used especially by forward-looking organisations – from being a bolt-on to an existing website or application, search is now the platform for new developments. I’ll be talking about other ways open source search has been used for news content at the British Computer Society this coming Thursday 21st October – I believe there are still a few places available.

Tags: , , , ,

Posted in Technical, events

October 19th, 2010