Posts Tagged ‘database’

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

London Enterprise Search Meetup – Databases vs. Search and Taxonomies

Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.

Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.

Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.

Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.

Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.

Tags: , , , , ,

Posted in events

April 14th, 2011

1 Comment »

Replacing relational databases with search engines for simple lookups

One of the things we often notice about existing systems based on relational databases (RDB) is that as they scale to millions of items, simple lookup tasks become slow and inefficient. These tasks don’t usually require complicated database operations, so in most cases it is possible to relocate the data from the RDB into a search engine like Flax.

Consider a system where a search engine has already been implemented to search textual product information, but numerical data on each product, such as price, is still being stored in a RDB. Users will often need filters on search results such as ’show me items under £10′ and so a RDB operation similar to ‘SELECT productID FROM products WHERE price<£10‘ will be needed, in addition to the search engine query. Modern search engines like Flax implement range search functions, so that numerical information can be added to documents, and it is thus possible to carry out this operation in the search engine as part of the full-text search for the product information.

We’ve noticed with several clients that it is now possible to move all their data from the original RDB into the search engine. This can obviously lead to cost savings, as only one system must be hosted, maintained and backed up, and scaling out can be far simpler.

Another way to look at this is to consider a search engine as an example of a document-oriented database.

Tags: ,

Posted in Technical

August 27th, 2009

No Comments »