Elasticsearch Meetup – Spark, postcodes and Couchbase

Three speakers for this month’s Elasticsearch Meetup (slides now up), kindly hosted by JustEat’s technical department. Neil Andrassy kicked us off with a talk about how TheFilter (which you may know counts Peter Gabriel as an investor) use Apache Spark to load data into their Elasticsearch cluster. Neil described how Spark and Elasticsearch have superseded both Microsoft SQL and MongoDB – Spark in particular being described as ‘speedy, flexible and componentized’, with Spark’s RDD (Resilient Distributed Datasets) mapping cleanly to Elasticsearch shards. He then showed a demo of UK road accident data being imported into Spark as CSV files, indexed automatically in Elasticsearch and then queried both using Elasticsearch and by Spark’s SQL-like facility. Interestingly, this allows a powerful combination of free text search and relational JOINs to be applied to data in a highly scalable fashion – Spark also features machine learning and streaming data components.

After a quick plug for ElastiCON in London in November, Matt Jones of JustEat described how they have used Elasticsearch’s geolocation search function to improve their handling of restaurant delivery areas. Their previous system only handled the first part of postcodes (e.g ‘SE1’) and they needed finer-grained control of the areas that restaurants were able to deliver to. By indexing polygons representing UK postcode areas and combining these with custom shapes (i.e. a circle representing a maximum delivery distance) they have created a powerful and extendable way to restrict search results. Matt has blogged about this in more detail.

The last talk was by Tom Green of Couchbase, who described how this powerful NoSQL platform is architected and how it can be connected directly to Elasticsearch using its own Cross Data Centre Replication (XDCR) feature. We finished with the usual Q&A during which Mark Harwood responded to my own question on exact facet counts in Elasticsearch with a plea to the industry to be more honest about the limitations of distributed systems – much like the CAP theorem, perhaps we need a similar triangle with vertices of Big Data, Speed and Accuracy – pick two!

Thanks as ever to all the speakers and the hosts, and to Yann Cluchey for organising the Meetup.

Leave a Reply

Your email address will not be published. Required fields are marked *