logstash – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Better performance with the Logstash DNS filter http://www.flax.co.uk/blog/2017/08/17/better-performance-logstash-dns-filter/ http://www.flax.co.uk/blog/2017/08/17/better-performance-logstash-dns-filter/#comments Thu, 17 Aug 2017 15:45:58 +0000 http://www.flax.co.uk/?p=3591 We’ve been working on a project for a customer which uses Logstash to read messages from Kafka and write them to Elasticsearch. It also parses the messages into fields, and depending on the content type does DNS lookups (both forward and … More

The post Better performance with the Logstash DNS filter appeared first on Flax.

]]>
We’ve been working on a project for a customer which uses Logstash to read messages from Kafka and write them to Elasticsearch. It also parses the messages into fields, and depending on the content type does DNS lookups (both forward and reverse.)

While performance testing I noticed that adding caching to the Logstash DNS filter actually reduced performance, contrary to expectations. With four filter worker threads, and the following configuration:

dns { 
  resolve => [ "Source_IP" ] 
  action => "replace" 
  hit_cache_size => 8000 
  hit_cache_ttl => 300 
  failed_cache_size => 1000 
  failed_cache_ttl => 10
}

the maximum throughput was only 600 messages/s, as opposed to 1000 messages/s with no caching (4000/s with no DNS lookup at all).

This was very odd, so I looked at the source code. Here is the DNS lookup when a cache is configured:

address = @hitcache.getset(raw) { retriable_getaddress(raw) }

This executes retriable_getaddress(raw) inside the getset() cache method, which is synchronised. Therefore, concurrent DNS lookups are impossible when a cache is used.

To see if this was the problem, I created a fork of the dns filter which does not synchronise the retriable_getaddress() call.

 address = @hit_cache[raw]
 if address.nil?
   address = retriable_getaddress(raw)
   unless address.nil?
     @hit_cache[raw] = address
   end
 end

Tests on the same data revealed a throughput of nearly 2000 messages/s with four worker threads (and 2600 with eight threads), which is a significant improvement.

This filter has the disadvantage that it might redundantly look up the same address multiple times, if the same domain name/IP address turns up in several worker threads simultaneously (but the risk of this is probably pretty low, depending on the input data, and in any case it’s harmless.)

I have released a gem of the plugin if you want to try it. Comments appreciated.

The post Better performance with the Logstash DNS filter appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/08/17/better-performance-logstash-dns-filter/feed/ 2
Flax announces partnership with Apache Kafka creators Confluent http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/ http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/#respond Thu, 07 Apr 2016 10:22:14 +0000 http://www.flax.co.uk/?p=3167 We’re very happy to announce our partnership with Confluent, which was founded by the creators of Apache Kafka, a stream data platform and the central component of their Confluent Platform. Flax has been aware of Kafka since its inception at … More

The post Flax announces partnership with Apache Kafka creators Confluent appeared first on Flax.

]]>
We’re very happy to announce our partnership with Confluent, which was founded by the creators of Apache Kafka, a stream data platform and the central component of their Confluent Platform. Flax has been aware of Kafka since its inception at LinkedIn, where it is used as the messaging backbone for a wide array of technical and business data, like click stream events, ad impressions, social network change events, systems monitoring, messaging, analytics and logging applications.

Kafka has been described as ‘TiVo for data’ – you can put pretty much any streaming data into Kafka, store it in a distributed and resilient way and then play it out again from any point. It’s highly scalable and integrates well with other Big Data tools such as Apache Hadoop. We’ve used Kafka and its sister project Apache Samza to develop prototype high-performance media monitoring systems and we’re also using it along with Elasticsearch, Logstash and Kibana (the ELK stack) to develop log monitoring and analysis systems. We’re hearing about many other potential uses of Kafka in the Big Data and Internet of Things ecosystems.

Our partnership with Confluent will allow us to work more closely together to provide a foundation for delivering better solutions faster for our customers based on Kafka and Confluent Platform, a complete and fully supported streaming data system based on Kafka and Hadoop.

“Kafka is creating a new paradigm for organizations and allowing businesses across industries to make informed, timely decisions from their data in real time” said Jabari Norton, VP Business Development at Confluent. “We are excited to include Flax among the ranks of a growing landscape of diverse partners and systems integrators committed to unlocking the potential of streaming data for their customers.”

We’ll be talking at the London Kafka meetup on April 13th if you’d like to find out more or discuss a potential Kafka project – if you can’t make it do get in touch.

The post Flax announces partnership with Apache Kafka creators Confluent appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/07/flax-announces-partnership-apache-kafka-creators-confluent/feed/ 0
Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/ http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/#comments Thu, 03 Mar 2016 10:01:00 +0000 http://www.flax.co.uk/?p=3055 We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them. The grandfather of them all of course is Apache Hadoop, now not so much … More

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them.

The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrote Lucene (the search library used by Apache Solr and Elasticsearch) and the Nutch web crawler. We’ve been helping clients distribute processing tasks using Hadoop’s MapReduce algorithm and also to speed up their indexing from Hadoop into Elasticsearch. Other projects we’ve used in the Hadoop ecosystem include Apache Zookeeper (used to coordinate lots of Solr servers into a distributed SolrCloud) and Apache Spark (for distributed processing).

We’re increasingly using Apache Kafka (a message broker) for handling large volumes of streaming data, for example log files. Kafka provides persistent storage of these streams, which might be ingested and pre-processed using Logstash and then indexed with Elasticsearch and visualised with Kibana to build high-performance monitoring systems. Throughput of thousands of items a second is not uncommon and these open source systems can easily match the performance of proprietary monitoring engines such as Splunk at a far lower cost. Apache Samza, a stream processing framework, is based on Kafka and we’ve built a powerful full-text search for streams system using it. Note that Elasticsearch has a similar ‘stored search’ feature called Percolator, but this is quite a lot slower (as others have confirmed).

Most of the above systems are written in Java, and if not run on the Java Virtual Machine (JVM), so our experience building large, performant and resilient systems on this platform has been invaluable. We’ll be writing in more detail about these projects soon. I’ve always said that search experts have been dealing with Big Data since well before it gained popularity as a concept – so if you’re serious about Big Data, ask us how we could help!

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/feed/ 2
Elasticon London 2015 – more products, more scale, more users! http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/ http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/#respond Mon, 09 Nov 2015 11:49:58 +0000 http://www.flax.co.uk/?p=2782 Last week Elastic, the company behind Elasticsearch, landed in London for one of their current series of one-day events. The £50 entrance fee has been put to good use, raising £16750 for AbilityNet who work on accessible IT – a … More

The post Elasticon London 2015 – more products, more scale, more users! appeared first on Flax.

]]>
Last week Elastic, the company behind Elasticsearch, landed in London for one of their current series of one-day events. The £50 entrance fee has been put to good use, raising £16750 for AbilityNet who work on accessible IT – a very generous offer by Elastic.

Shay Banon, creator of Elasticsearch, kicked off with a brief history of the project which started when he built the Compass search engine, pretty much as a hobby project while his wife was training as a chef in London. Things have moved on somewhat: today there is a 35,000 strong community with over 35 million downloads of the Elasticsearch software and a number of high-profile users including NASA, Wikimedia and Verizon (who apparently have an impressive 500 billion items indexed).

Clinton Gormley led the next session, talking about new features in the recent 2.0 release. Resiliency, performance and analytics were major themes, with the latter leveraging Lucene’s DocValues as an off-heap column store to build various prediction and detection capabilities. Also mentioned was a new scriptable Ingest Node incorporating parts of the Logstash project. Steve Mayzak then told us about the new version 4 of the Kibana visualisation package, which has now grown in a general UI framework incorporating D3.js for charting and providing an extension API. Shay returned to tell us more about Logstash, which provides over 200 plugins for ingesting data into Elasticsearch. Next up was Uri Boness telling us about the various closed-source parts of the Elasticsearch ecosystem (including the Marvel performance monitor and Shield secuurity module) and we then heard from Morten Ingebrigtsen of Found (a hosted Elasticsearch solution, who Elastic acquired a while ago). For me the most interesting item here was news of an on-premise version of Found Premium – yes, like Lucidworks Fusion, you can now buy a packaged open source search engine from Elastic as a product. This isn’t something we generally recommend as it does remove one of the key advantages of open source, which is the lack of vendor lock-in, but it’s interesting to see Elastic plough such a familiar furrow.

The afternoon consisted of case studies including The Guardian (which I’ve written about previously), a good talk from Jay Chin on using Elasticsearch for Grid Computing for the financial services sector and a couple of use cases from Goldman Sachs. We also heard about the elasticsearch-hadoop connector – note that for high-performance indexing this may not be the best option. I missed a couple of the other talks due to a phone call but returned to hear Shay again, with a controversial statement that ‘the top 8 Lucene committers now work for Elastic’ – how exactly are you measuring that and have you told the other committers? He did however conclude reassuringly with ‘we’re not trying to force anyone to use commercial versions [of Elasticsearch]’ – good to hear!

By the way, if you want to hear how we helped a billion-pound UK IT supplier use Elasticsearch for their e-commerce website, we’ll be presenting with them at the Elasticsearch London Meetup later this month.

The post Elasticon London 2015 – more products, more scale, more users! appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/09/elasticon-london-2015-more-products-more-scale-more-users/feed/ 0
Elastic London User Group Meetup – scaling with Kafka and Cassandra http://www.flax.co.uk/blog/2015/03/26/elastic-london-user-group-meetup-scaling-with-kafka-and-cassandra/ http://www.flax.co.uk/blog/2015/03/26/elastic-london-user-group-meetup-scaling-with-kafka-and-cassandra/#respond Thu, 26 Mar 2015 10:41:02 +0000 http://www.flax.co.uk/blog/?p=1421 The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with … More

The post Elastic London User Group Meetup – scaling with Kafka and Cassandra appeared first on Flax.

]]>
The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.

The post Elastic London User Group Meetup – scaling with Kafka and Cassandra appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/03/26/elastic-london-user-group-meetup-scaling-with-kafka-and-cassandra/feed/ 0
ElasticSearch London Meetup – a busy and interesting evening! http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/ http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/#respond Wed, 26 Feb 2014 13:44:43 +0000 http://www.flax.co.uk/blog/?p=1139 I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest … More

The post ElasticSearch London Meetup – a busy and interesting evening! appeared first on Flax.

]]>
I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene‘s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

The post ElasticSearch London Meetup – a busy and interesting evening! appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/02/26/elasticsearch-london-meetup-a-busy-and-interesting-evening/feed/ 0