database – Flax

London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr

Charlie Hull — Fri, 02 Jun 2017 14:31:32 +0000

Bloomberg kindly hosted the London Lucene/Solr Meetup last night and we were lucky enough to have two excellent speakers for the thirty or so attendees. René Kriegler kicked off with a talk about the Querqy library he has developed to provide a pre-processing layer for Solr (and soon, Elasticsearch) queries. This library was originally developed during a project for Germany’s largest department store Galeria Kaufhof and allows users to add a series of simple rules in a text file to raise or lower results containing certain words, filter out certain results, add synonyms and decompound words (particularly important for German!). We’ve seen similar rules-based systems in use at many of our e-commerce clients, but few of these work well with Solr (Hybris in particular has a poor integration with Solr and can produce some very strange Solr queries). In contrast, Querqy is open source and designed by someone with expert Solr knowledge. With the addition of a simple UI or an integration with a relevancy-testing framework such as Quepid, this could be a fantastic tool for day-to-day tuning of search relevance – without the need for Solr expertise. You can find Querqy on Github.

Michael Suzuki of Alfresco talked next about the importance of being bilingual (actually he speaks 4 languages!) and how new features in Solr version 6 allow one to use either Solr syntax, SQL expressions or a combination of both. This helps hide Solr’s complexity and also allows easy integration with database administration and reporting tools, while allowing use of Solr by the huge number of developers and database administrators familiar with SQL syntax. Using a test set from the IMDB movie archive he demonstrated how SQL expressions can be used directly on a Solr index to answer questions such as ‘what are the highest grossing film actors’. He then used visualisation tool Apache Zeppelin to produce various graphs based on these queries and also showed dbVisualizer, a commonly used database administration tool, connecting directly to Solr via JDBC and showing the index contents as if they were just another set of SQL tables. He finished by talking briefly about the new statistical programming features in Solr 6.6 – a powerful new development with features similar to the R language.

We continued with a brief Q&A session . Thanks to both our speakers – we’ll be back again soon!

The post London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr appeared first on Flax.

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Charlie Hull — Tue, 16 Dec 2014 14:22:30 +0000

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

The post Elasticsearch London user group – The Guardian & Orchestrate test the limits appeared first on Flax.

London Enterprise Search Meetup – Databases vs. Search and Taxonomies

Charlie Hull — Thu, 14 Apr 2011 08:45:47 +0000

Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.

Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.

Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.

Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.

Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.

The post London Enterprise Search Meetup – Databases vs. Search and Taxonomies appeared first on Flax.

Replacing relational databases with search engines for simple lookups

Charlie Hull — Thu, 27 Aug 2009 16:34:41 +0000

One of the things we often notice about existing systems based on relational databases (RDB) is that as they scale to millions of items, simple lookup tasks become slow and inefficient. These tasks don’t usually require complicated database operations, so in most cases it is possible to relocate the data from the RDB into a search engine like Flax.

Consider a system where a search engine has already been implemented to search textual product information, but numerical data on each product, such as price, is still being stored in a RDB. Users will often need filters on search results such as ‘show me items under £10’ and so a RDB operation similar to ‘SELECT productID FROM products WHERE price<£10‘ will be needed, in addition to the search engine query. Modern search engines like Flax implement range search functions, so that numerical information can be added to documents, and it is thus possible to carry out this operation in the search engine as part of the full-text search for the product information.

We’ve noticed with several clients that it is now possible to move all their data from the original RDB into the search engine. This can obviously lead to cost savings, as only one system must be hosted, maintained and backed up, and scaling out can be far simpler.

Another way to look at this is to consider a search engine as an example of a document-oriented database.

The post Replacing relational databases with search engines for simple lookups appeared first on Flax.