kibana – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Worth the wait – Apache Kafka hits 1.0 release http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/ http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/#respond Thu, 02 Nov 2017 09:50:20 +0000 http://www.flax.co.uk/?p=3629 We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple … More

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple but powerful concepts – everything it deals with is a stream of data (like a messaging system), streams can be combined for processing and stored reliably in a highly fault-tolerant way. It’s also massively scalable.

For search applications, Kafka is a great choice for the ‘wiring’ between source data (databases, crawlers, flat files, feeds) and the search index and other parts of the system. We’ve used other message passing systems (like RabbitMQ) in projects before, but none have the simplicity and power of Kafka. Combine the search index with analysis and visualisation tools such as Kibana and you can build scalable, real-time systems for ingesting, storing, searching and analysing huge volumes of data – for example, we’ve already done this for clients in the financial sector wanting to monitor log data using open-source technology, rather than commercial tools such as Splunk.

The development of Kafka has been masterminded by our partners Confluent, and it’s a testament to this careful management that the milestone 1.0 version has only just appeared. This doesn’t mean that previous versions weren’t production ready – far from it – but it’s a sign that Kafka has now matured to be a truly enterprise-scale project. Congratulations to all the Kafka team for this great achievement.

We look forward to working more with this great software – and if you need help with your Kafka project do get in touch!

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/feed/ 0
Elasticsearch, Kibana and duplicate keys in JSON http://www.flax.co.uk/blog/2017/08/03/inconsistent-json-semantics-headache/ http://www.flax.co.uk/blog/2017/08/03/inconsistent-json-semantics-headache/#respond Thu, 03 Aug 2017 11:05:14 +0000 http://www.flax.co.uk/?p=3576 JSON has been the lingua franca of data exchange for many years. It’s human-readable, lightweight and widely supported. However, the JSON spec does not define what parsers should do when they encounter a duplicate key in an object, e.g.: { "foo": … More

The post Elasticsearch, Kibana and duplicate keys in JSON appeared first on Flax.

]]>
JSON has been the lingua franca of data exchange for many years. It’s human-readable, lightweight and widely supported. However, the JSON spec does not define what parsers should do when they encounter a duplicate key in an object, e.g.:

{
  "foo": "spam",
  "foo": "eggs",
  ...
}

Implementations are free to interpret this how they like. When different systems have different interpretations this can cause problems.

We recently encountered this in an Elasticsearch project. The customer reported unusual search behaviour around a boolean field called draft. In particular, documents which were thought to contain a true value for draft were being excluded by the query clause

{
  "query":
    "bool": {
      "must_not": {
        "term": { "draft": false }
      },
      ...

The version of Elasticsearch was 2.4.5 and we examined the index with Sense on Kibana 4.6.3. The documents in question did indeed appear to have the value

{
  "draft": true,
  ...
}

and therefore should not have been excluded by the must_not query clause.

To get to the bottom of it, we used Marple to examine the terms in the index. Under the bonnet, the boolean type is indexed as the term “T” for true and “F” for false. The documents which were behaving oddly had both “T” and “F” terms for the draft field, and were therefore being excluded by the must_not clause. But how did the extra “F” term get in there?

After some more experimentation we tracked it down to a bug in our indexer application, which under certain conditions was creating documents with duplicate draft keys:

{
  "draft": false,
  "draft": true
  ...
}

So why was this not appearing in the Sense output? It turns out that Elasticsearch and Sense/Kibana interpret duplicate keys in different ways. When we used curl instead of Sense we could see both draft items in the _source field. Elasticsearch was behaving consistently, storing and indexing both draft fields. However, Sense/Kibana was quietly dropping the first instance of the field and displaying only the second, true, value.

I’ve not looked at the Sense/Kibana source code, but I imagine this is just a consequence of being implemented in Javascript. I tested this in Chrome (59.0.3071.115 on macOS) with the following script:

<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <script>
      var o = {
        s: "this is some text",
        b: true,
        b: false
      };

      console.log("value of o.b", o.b);
      console.log("value of o", JSON.stringify(o, "", 2));
    </script>
  </body>
</html>

which output (with no warnings)

value of o.b true
test.html:13 value of o {
 "s": "this is some text",
 "b": true
}

(in fact it turns out that order of b doesn’t matter, true always overrides false.)

Ultimately this wasn’t caused by any bugs in Elasticsearch, Kibana, Sense or Javascript, but the different way that duplicate JSON keys were being handled made finding the ultimate source of the problem harder than it needed to be. If you are using the Kibana console (or Sense with older versions) for Elasticsearch development then this might be a useful thing to be aware of.

I haven’t tested Solr’s handling of duplicate JSON keys yet but that would probably be an interesting exercise.

The post Elasticsearch, Kibana and duplicate keys in JSON appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/08/03/inconsistent-json-semantics-headache/feed/ 0
Flax announces partnership with Apache Kafka creators Confluent http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/ http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/#respond Thu, 07 Apr 2016 10:22:14 +0000 http://www.flax.co.uk/?p=3167 We’re very happy to announce our partnership with Confluent, which was founded by the creators of Apache Kafka, a stream data platform and the central component of their Confluent Platform. Flax has been aware of Kafka since its inception at … More

The post Flax announces partnership with Apache Kafka creators Confluent appeared first on Flax.

]]>
We’re very happy to announce our partnership with Confluent, which was founded by the creators of Apache Kafka, a stream data platform and the central component of their Confluent Platform. Flax has been aware of Kafka since its inception at LinkedIn, where it is used as the messaging backbone for a wide array of technical and business data, like click stream events, ad impressions, social network change events, systems monitoring, messaging, analytics and logging applications.

Kafka has been described as ‘TiVo for data’ – you can put pretty much any streaming data into Kafka, store it in a distributed and resilient way and then play it out again from any point. It’s highly scalable and integrates well with other Big Data tools such as Apache Hadoop. We’ve used Kafka and its sister project Apache Samza to develop prototype high-performance media monitoring systems and we’re also using it along with Elasticsearch, Logstash and Kibana (the ELK stack) to develop log monitoring and analysis systems. We’re hearing about many other potential uses of Kafka in the Big Data and Internet of Things ecosystems.

Our partnership with Confluent will allow us to work more closely together to provide a foundation for delivering better solutions faster for our customers based on Kafka and Confluent Platform, a complete and fully supported streaming data system based on Kafka and Hadoop.

“Kafka is creating a new paradigm for organizations and allowing businesses across industries to make informed, timely decisions from their data in real time” said Jabari Norton, VP Business Development at Confluent. “We are excited to include Flax among the ranks of a growing landscape of diverse partners and systems integrators committed to unlocking the potential of streaming data for their customers.”

We’ll be talking at the London Kafka meetup on April 13th if you’d like to find out more or discuss a potential Kafka project – if you can’t make it do get in touch.

The post Flax announces partnership with Apache Kafka creators Confluent appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/feed/ 0
Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/ http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/#comments Thu, 03 Mar 2016 10:01:00 +0000 http://www.flax.co.uk/?p=3055 We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them. The grandfather of them all of course is Apache Hadoop, now not so much … More

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them.

The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrote Lucene (the search library used by Apache Solr and Elasticsearch) and the Nutch web crawler. We’ve been helping clients distribute processing tasks using Hadoop’s MapReduce algorithm and also to speed up their indexing from Hadoop into Elasticsearch. Other projects we’ve used in the Hadoop ecosystem include Apache Zookeeper (used to coordinate lots of Solr servers into a distributed SolrCloud) and Apache Spark (for distributed processing).

We’re increasingly using Apache Kafka (a message broker) for handling large volumes of streaming data, for example log files. Kafka provides persistent storage of these streams, which might be ingested and pre-processed using Logstash and then indexed with Elasticsearch and visualised with Kibana to build high-performance monitoring systems. Throughput of thousands of items a second is not uncommon and these open source systems can easily match the performance of proprietary monitoring engines such as Splunk at a far lower cost. Apache Samza, a stream processing framework, is based on Kafka and we’ve built a powerful full-text search for streams system using it. Note that Elasticsearch has a similar ‘stored search’ feature called Percolator, but this is quite a lot slower (as others have confirmed).

Most of the above systems are written in Java, and if not run on the Java Virtual Machine (JVM), so our experience building large, performant and resilient systems on this platform has been invaluable. We’ll be writing in more detail about these projects soon. I’ve always said that search experts have been dealing with Big Data since well before it gained popularity as a concept – so if you’re serious about Big Data, ask us how we could help!

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/feed/ 2
Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/ http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/#comments Thu, 18 Feb 2016 11:42:07 +0000 http://www.flax.co.uk/?p=3026 Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at … More

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture: IMAG0507
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/02/18/unified-log-meetup-scaling-skyscanner-samza-samsara/feed/ 2
Elasticon London 2015 – more products, more scale, more users! http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/ http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/#respond Mon, 09 Nov 2015 11:49:58 +0000 http://www.flax.co.uk/?p=2782 Last week Elastic, the company behind Elasticsearch, landed in London for one of their current series of one-day events. The £50 entrance fee has been put to good use, raising £16750 for AbilityNet who work on accessible IT – a … More

The post Elasticon London 2015 – more products, more scale, more users! appeared first on Flax.

]]>
Last week Elastic, the company behind Elasticsearch, landed in London for one of their current series of one-day events. The £50 entrance fee has been put to good use, raising £16750 for AbilityNet who work on accessible IT – a very generous offer by Elastic.

Shay Banon, creator of Elasticsearch, kicked off with a brief history of the project which started when he built the Compass search engine, pretty much as a hobby project while his wife was training as a chef in London. Things have moved on somewhat: today there is a 35,000 strong community with over 35 million downloads of the Elasticsearch software and a number of high-profile users including NASA, Wikimedia and Verizon (who apparently have an impressive 500 billion items indexed).

Clinton Gormley led the next session, talking about new features in the recent 2.0 release. Resiliency, performance and analytics were major themes, with the latter leveraging Lucene’s DocValues as an off-heap column store to build various prediction and detection capabilities. Also mentioned was a new scriptable Ingest Node incorporating parts of the Logstash project. Steve Mayzak then told us about the new version 4 of the Kibana visualisation package, which has now grown in a general UI framework incorporating D3.js for charting and providing an extension API. Shay returned to tell us more about Logstash, which provides over 200 plugins for ingesting data into Elasticsearch. Next up was Uri Boness telling us about the various closed-source parts of the Elasticsearch ecosystem (including the Marvel performance monitor and Shield secuurity module) and we then heard from Morten Ingebrigtsen of Found (a hosted Elasticsearch solution, who Elastic acquired a while ago). For me the most interesting item here was news of an on-premise version of Found Premium – yes, like Lucidworks Fusion, you can now buy a packaged open source search engine from Elastic as a product. This isn’t something we generally recommend as it does remove one of the key advantages of open source, which is the lack of vendor lock-in, but it’s interesting to see Elastic plough such a familiar furrow.

The afternoon consisted of case studies including The Guardian (which I’ve written about previously), a good talk from Jay Chin on using Elasticsearch for Grid Computing for the financial services sector and a couple of use cases from Goldman Sachs. We also heard about the elasticsearch-hadoop connector – note that for high-performance indexing this may not be the best option. I missed a couple of the other talks due to a phone call but returned to hear Shay again, with a controversial statement that ‘the top 8 Lucene committers now work for Elastic’ – how exactly are you measuring that and have you told the other committers? He did however conclude reassuringly with ‘we’re not trying to force anyone to use commercial versions [of Elasticsearch]’ – good to hear!

By the way, if you want to hear how we helped a billion-pound UK IT supplier use Elasticsearch for their e-commerce website, we’ll be presenting with them at the Elasticsearch London Meetup later this month.

The post Elasticon London 2015 – more products, more scale, more users! appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/feed/ 0
ElasticSearch London Meetup – a busy and interesting evening! http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/ http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/#respond Wed, 26 Feb 2014 13:44:43 +0000 http://www.flax.co.uk/blog/?p=1139 I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest … More

The post ElasticSearch London Meetup – a busy and interesting evening! appeared first on Flax.

]]>
I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene‘s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

The post ElasticSearch London Meetup – a busy and interesting evening! appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/feed/ 0