streaming – Flax

Activate 2018 day 1 – AI and Search in Montreal

Charlie Hull — Tue, 30 Oct 2018 13:34:53 +0000

Activate is the successor to the Lucene/Solr Revolution conference that our partner Lucidworks runs every Autumn and was held this year in Montreal, Canada. After running a successful Lucene Hackday on the Monday before the conference, we joined hundreds of others to hear Will Hayes, the CEO of Lucidworks, explain the new name and direction of the event – it was nice to hear he agrees with me that search is the key to AI. Yoshua Bengio of local AI laboratory MILA followed Will and described some recent breakthroughs in AI including speech recognition, image recognition and went on to talk about Creative AI which can ‘imagine’ new faces after sufficient training. He listed five necessary ingredients for successful machine learning: lots of data, flexible models, enough compute power, computationally efficient inference and powerful prior assumptions to deflect the ‘curse of dimensionality’. These are hard to get right – he told us how even cutting-edge AI is still far from human-level intelligence but can be used to extend human cognitive power. MILA is the greatest concentration of academics working in deep learning in the world and heavily funded by the Canadian government.

I was also pleased to notice our Luwak stored search library mentioned in the handout Bloomberg had placed on every seat!

The talks I attended after the keynote were generally focused on open source, Solr or search topics, but the theme of AI was everywhere. The first talk I went to was about Accenture’s Content Analytics Studio – which looks like a useful tool for building search and analytics applications using a library of widgets and a Python code editor. Unfortunately it wasn’t very clear how one might use this platform, with the presenter eventually admitting that it was a proprietary product but not giving any idea of the price or business model. I would much prefer if presenters were up-front about commercial products, especially as many attendees were from an open source background.

David Smiley‘s talk on Querying Hundreds of Fields at Scale was a lot more interesting: he described how Salesforce run millions of Solr cores and index extremely diverse customer data (as each one can customise their field structure). Using the usual Solr qf operator across possibly 150 fields can lead to thousands of subqueries being generated which also need to be run across each segment. His approach to optimising performance included analysing the input data per field type rather than per field, building a custom segment merge policy and encoding the field type as a term suffix in the term dictionary. Although this uses more CPU time, it improves performance by at least a factor of 10. David hopes to contribute some of this work back to Solr as open source, although much is specific to Salesforce’ use case. This was a fascinating talk about some very clever low-level Lucene techniques.

Next was my favourite talk of the conference – Kevin Watters on the Intersection of Robotics, Search & AI, featuring a completely 3D-printed humanoid robot based on the open source InMoov platform and MyRobotLab software. Kevin has used hundreds of open source projects to add capabilities such as speech recognition, question answering (based on Wikipedia), computer vision, deep learning etc. using a pub/sub architecture. The robot’s ‘memory’ – everything it does, sees, hears and how the various modules interact – is stored in a Solr index. Kevin’s engaging talk showed us examples of how the robot’s search engine powered memory can be used for deep learning, for example for image recognition – in his demo it could be trained to recognise pictures of some Solr commmitters. This really was the crossover between search and AI!

Joel Bernstein then took us through Applied Mathematical Modelling with Apache Solr – describing the ongoing work to integrate the Apache Commons Math library. In particular he showed how these new features can be used for anomaly detection (e.g. an unusually slow network connection) using a simple linear regression model. Solr’s Streaming API can be used to run a constant prediction of the likely response times for sending files of a certain size and any statistically significant differences noted. This is just one example of the powerful features now available for Solr-based analytics – there was more to come in Amrit Sarkar‘s talk afterwards on Building Analytics Applications with Streaming Expressions. Amrit showed a demo (code available here) using Apache Zeppelin where Solr’s various SQL-style operations can be run in parallel for better performance, splitting the job up over a number of worker collections. As the demo imported data directly from a database using a JDBC connector, some of us in the room wondered whether this might be a higher-performing alternative to the venerable (and slow) Data Import Handler…

That was the last talk I saw on Wednesday: that evening was the conference party in a nearby bar, which was a lot of fun (although the massive TV screen showing that night’s hockey game was a little distracting!). I’ll write about day 2 soon: videos of the talks are likely to be available soon on Lucidworks’ Youtube channel and I’ll update this post when they appear.

The post Activate 2018 day 1 – AI and Search in Montreal appeared first on Flax.

Flax announces partnership with Apache Kafka creators Confluent

Charlie Hull — Thu, 07 Apr 2016 10:22:14 +0000

We’re very happy to announce our partnership with Confluent, which was founded by the creators of Apache Kafka, a stream data platform and the central component of their Confluent Platform. Flax has been aware of Kafka since its inception at LinkedIn, where it is used as the messaging backbone for a wide array of technical and business data, like click stream events, ad impressions, social network change events, systems monitoring, messaging, analytics and logging applications.

Kafka has been described as ‘TiVo for data’ – you can put pretty much any streaming data into Kafka, store it in a distributed and resilient way and then play it out again from any point. It’s highly scalable and integrates well with other Big Data tools such as Apache Hadoop. We’ve used Kafka and its sister project Apache Samza to develop prototype high-performance media monitoring systems and we’re also using it along with Elasticsearch, Logstash and Kibana (the ELK stack) to develop log monitoring and analysis systems. We’re hearing about many other potential uses of Kafka in the Big Data and Internet of Things ecosystems.

Our partnership with Confluent will allow us to work more closely together to provide a foundation for delivering better solutions faster for our customers based on Kafka and Confluent Platform, a complete and fully supported streaming data system based on Kafka and Hadoop.

“Kafka is creating a new paradigm for organizations and allowing businesses across industries to make informed, timely decisions from their data in real time” said Jabari Norton, VP Business Development at Confluent. “We are excited to include Flax among the ranks of a growing landscape of diverse partners and systems integrators committed to unlocking the potential of streaming data for their customers.”

We’ll be talking at the London Kafka meetup on April 13th if you’d like to find out more or discuss a potential Kafka project – if you can’t make it do get in touch.

The post Flax announces partnership with Apache Kafka creators Confluent appeared first on Flax.

Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara

Charlie Hull — Thu, 18 Feb 2016 11:42:07 +0000

Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture:
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

Searching for opportunities in Real-Time Analytics

Charlie Hull — Mon, 02 Feb 2015 17:18:22 +0000

I spent a day last week at a new event from UNICOM, a conference on Real-Time Analytics. Mike Ferguson chaired the event and was kind enough to spend time with me over lunch exploring how search software might fit into the mix, something that has been on my mind since hearing about the Unified Log concept a few weeks ago.

Real-Time Analytics is a field where sometimes vast amounts of data in motion is gathered, filtered, cleaned and analysed to trigger various actions to benefit a business: building on earlier capabilities in Business Intelligence, the endgame is a business that adapts automatically to changing conditions in real-time – for example, automating the purchasing of extra stock based on changing behaviour of customers. The analysis part of this chain is driven by complex models, often based on sets of training data. Complex Event Processing or CEP is an older term for this kind of process (if you’re already suffering from buzzword overflow, Martin Kleppman has put some of these terms in context for those more familiar with web paradigms). Tools mentioned included Amazon Kinesis and from the Apache stable Cassandra, Hadoop, Kafka, Yarn, Storm and Spark. I particularly enjoyed Michael Cutler‘s presentation on Tumra’s Spark-based system.

One of the central problems identified was due to the rapid growth of data (including from the fabled Internet of Things) it will shortly be impossible to store every data point produced – so we must somehow sort the wheat from the chaff. Options for the analysis part include SQL-like query languages and more complex machine learning algorithms. I found myself wondering if search technology, using a set of stored queries, could be used somehow to reduce the flow of this continuous stream of data, using something like this prototype implementation based on Apache Samza. One could use this approach to transform unstructured data (say, a stream of text-based customer comments) into more structured data for later timeline analysis, split streams of events into several parts for separate processing or just to watch for sets of particularly interesting and complex events. Although search platforms such as Elasticsearch are already being integrated into the various Real-Time Analytics frameworks, these seem to be being used for offline processing rather than acting directly on the stream itself.

One potential advantage is that it might be a lot easier for analysts to generate a stored search than to learn SQL or the complexities of machine learning – just spend some time with a collection of past events and refine your search terms, facets and filters until your results are useful, and save the query you have generated.

This was a very interesting introduction to a relatively new field and thanks to UNICOM for the invitation. We’re going to continue to explore the possibilities!

The post Searching for opportunities in Real-Time Analytics appeared first on Flax.