stream processing – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/ http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/#comments Thu, 18 Feb 2016 11:42:07 +0000 http://www.flax.co.uk/?p=3026 Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at … More

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture: IMAG0507
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/feed/ 2
Searching & monitoring the Unified Log http://www.flax.co.uk/blog/2014/12/05/searching-monitoring-the-unified-log/ http://www.flax.co.uk/blog/2014/12/05/searching-monitoring-the-unified-log/#respond Fri, 05 Dec 2014 11:10:25 +0000 http://www.flax.co.uk/blog/?p=1331 This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject … More

The post Searching & monitoring the Unified Log appeared first on Flax.

]]>
This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

The post Searching & monitoring the Unified Log appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/12/05/searching-monitoring-the-unified-log/feed/ 0