samza – Flax

Apache Kafka London Meetup – Real time search and insights

Charlie Hull — Thu, 14 Apr 2016 09:50:05 +0000

The rise of Apache Kafka as a streaming data solution is something we’ve been watching for a while – as part of a collection of Big Data tools, it provides a ‘TiVo for data‘ feature. We’ve begun to use it in client projects covering both search and log analysis and we’ve recently partnered with Confluent, founded by the creators of Kafka.

Last night we spoke at the Apache Kafka London Meetup – hosted by British Gas Connected Homes, it was well supplied with drinks, pizza and snacks and also very well attended – there was a great buzz of conversation before the talks had even started! Alan Woodward of Flax started with an updated talk about our proof-of-concept integration of Kafka, Apache Samza and our own Luwak streaming search library (slides are available here). This allows full-text search within a Kafka stream, with the search queries supplied as another stream, for a truly real-time solution – as opposed to the more usual (and much higher latency) approach of indexing the endpoint of a stream. Alan has also tried the very new Kafka Streams feature which can be used as an alternative to Apache Samza – there is some very early code available, although note that this still needs some work! (We’ll update this blog when it’s finished).

The second talk was by one of our hosts, Josep Casals, on how British Gas have used Kafka, Spark Streaming and Apache Cassandra to build a platform for analyzing data from smart meters, boilers and thermostats. Over 2 million smart meters are installed across the UK and there are also over 300,000 connected thermostats, plus many other data sources, and these devices can report every 30 minutes and 2 minutes respectively, so their system has to cope with around 30,000 messages/second. One interesting feature for me was how machine learning is used to disaggregrate power consumption data, so the consumption for say, a fridge can be split out from the overall figure. Apache Samza is also used in this system to provide estimates of consumption and interpolate between readings, allowing data to be fed back to an app on the customer’s mobile device. Further use cases include spotting outlier events, which might indicate failing heating devices or even unusual patterns in an elderly person’s home to alert relatives or carers.

Both talks were live streamed and you can watch them here.

We concluded with some informal discussion and a chance to meet some of Confluent’s UK-based team. Thanks to the organisers and hosts and we look forward to returning! If you have a Kafka project and you’d like any help or advice, do let us know.

The post Apache Kafka London Meetup – Real time search and insights appeared first on Flax.

Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem

Charlie Hull — Thu, 03 Mar 2016 10:01:00 +0000

We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them.

The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrote Lucene (the search library used by Apache Solr and Elasticsearch) and the Nutch web crawler. We’ve been helping clients distribute processing tasks using Hadoop’s MapReduce algorithm and also to speed up their indexing from Hadoop into Elasticsearch. Other projects we’ve used in the Hadoop ecosystem include Apache Zookeeper (used to coordinate lots of Solr servers into a distributed SolrCloud) and Apache Spark (for distributed processing).

We’re increasingly using Apache Kafka (a message broker) for handling large volumes of streaming data, for example log files. Kafka provides persistent storage of these streams, which might be ingested and pre-processed using Logstash and then indexed with Elasticsearch and visualised with Kibana to build high-performance monitoring systems. Throughput of thousands of items a second is not uncommon and these open source systems can easily match the performance of proprietary monitoring engines such as Splunk at a far lower cost. Apache Samza, a stream processing framework, is based on Kafka and we’ve built a powerful full-text search for streams system using it. Note that Elasticsearch has a similar ‘stored search’ feature called Percolator, but this is quite a lot slower (as others have confirmed).

Most of the above systems are written in Java, and if not run on the Java Virtual Machine (JVM), so our experience building large, performant and resilient systems on this platform has been invaluable. We’ll be writing in more detail about these projects soon. I’ve always said that search experts have been dealing with Big Data since well before it gained popularity as a concept – so if you’re serious about Big Data, ask us how we could help!

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara

Charlie Hull — Thu, 18 Feb 2016 11:42:07 +0000

Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture:
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.