performance – Flax

Elasticsearch vs. Solr: performance improvements

Tom — Fri, 18 Dec 2015 17:04:20 +0000

I had been planning not to continue with these posts, but after Matt Weber pointed out the github pull requests (which to my embarrassment I’d not even noticed) he’d made to address some methodological flaws, another attempt was the least I could do.

For Solr there was a slight reduction in mean search time, from 39ms (for my original, suboptimal query structure) to 34ms and median search time from 27ms to 25ms – see figure 1. Elasticsearch, on the other hand, showed a bigger improvement – see figure 2. Mean search time went down from 39ms to 27ms and median from 36ms to 24ms.

Comparing Solr with Elasticsearch using Matt’s changes, we get figure 3. The medians are close, at 25ms vs 24ms, but Elasticsearch has a significantly lower mean, at 27ms vs 34ms. The difference is even greater at the 99th percentile, at 57ms vs 126ms.

These results seem to confirm that Elasticsearch still has the edge over Solr. However, the QPS measurement (figure 4) now gives the advantage to Solr, at nearly 80 QPS, vs 60 QPS for Elasticsearch. The latter has actually decreased since making Matt’s changes. This last result is very unexpected, so I will be trying to reproduce both figures as soon as I have the chance (as well as Matt’s suggestion of trying the new-ish filter() operator in Solr).

Our sincere thanks to Matt for his valuable input.

Figure 1: Solr search times, no indexing, original vs Matt

Figure 2: ES search times, no indexing, original vs Matt

Figure 3: ES vs Solr search times, no indexing, Matt’s changes

Figure 4: QPS, original vs Matt’s changes

The post Elasticsearch vs. Solr: performance improvements appeared first on Flax.

Luwak 1.3.0 released

Alan Woodward — Tue, 17 Nov 2015 14:13:42 +0000

The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours. Here’s a summary of the new features and improvements we’ve made:

Batch processing

Inspired by a question raised during our talk at FOSDEM last February, you can now stream documents through the Luwak Monitor in batches, as well as one-at-a-time. This will generally improve your throughput, at the cost of a drop in latency. For example, local benchmarking against a set of 10,000 queries showed an improvement from 10 documents/second to 30 documents/second when the batch size was increased from 1 document to 30 documents; however, processing latency went from ~100ms for the single document to 10 seconds for the larger batch. You’ll need to experiment with batch sizes to find the right balance for your own use.

Presearcher performance improvements

Luwak speeds up document matching by filtering out queries that we can detect won’t match a given document or batch, a process we call presearching. Profiling revealed that creating the presearcher query was a serious performance bottleneck, particularly for presearchers using the WildcardNGramPresearcherComponent, so this has been largely rewritten in 1.3.0. We’ve seen improvements of up to 400% in query build times after this rewrite.

Concurrent query loading

Luwak now ships with a ConcurrentQueryLoader helper class to help speed up Monitor startup. The loader uses multiple threads to add queries to the index, allowing you to make use of all your CPUs when parsing and analyzing queries. Note that this requires your MonitorQueryParser implementations to be thread-safe!

Easier configuration and state monitoring

In 1.2.0 and earlier, clients had to extend the Monitor itself in order to configure the internal query caches or get state update information. Configuration has now been extracted into a QueryIndexConfiguration class, passed to the Monitor at construction, and you can get notified about updates to the query index by registering QueryIndexUpdateListeners.

For more information, see the CHANGES for 1.3.0. We’ll also be re-running the comparison with Elasticsearch Percolator soon, as this has also been improved as part of Elasticsearch’s recent 2.0 release.

The post Luwak 1.3.0 released appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations

Tom — Mon, 27 Jul 2015 07:04:25 +0000

Most search applications work by indexing a relatively stable collection of documents and then allowing users to perform ad-hoc searches to retrieve relevant documents. However, in some cases it is useful to turn this model on its head, and match individual documents against a collection of saved queries. I shall refer to this model as “streamed search”.

One example of streamed search is in media monitoring. The monitoring agency’s client’s interests are represented by stored queries. Incoming documents are matched against the stored queries, and hits are returned for further processing before being reported to the client. Another example is financial news monitoring, to predict share price movements.

In both these examples, queries may be extremely complex (in order to improve the accuracy of hits). There may be hundreds of thousands of stored queries, and documents to be matched may be incoming at a rate of hundreds or thousands per second. Not surprisingly, streamed search can be a demanding task, and the computing resources required to support it a significant expense. There is therefore a need for the software to be as performant and efficient as possible.

The two leading open source streamed search implementations are Elasticsearch Percolator, and Luwak. Both depend on the Lucene search engine. As the developers of Luwak, we have an interest in how its performance compares with Percolator. We therefore carried out some preliminary testing.

Ideally, we would have used real media monitoring queries and documents. However, these are typically protected by copyright, and the queries represent a fundamental asset of monitoring companies. In order to make the tests distributable, we chose to use freely dowloadable documents from Wikipedia, and to generate random queries. These queries were much simpler in structure than the often deeply nested queries from real applications, but we believe that they still provide a useful comparison.

The tests were carried out on an Amazon EC2 r3.large VM running Ubuntu. We wrote a Python script to download, parse and store random Wikipedia articles, and another to generate random queries from the text. The query generator was designed to be somewhat “realistic”, in that each query should match more than zero documents. For Elasticsearch, we wrote scripts to index queries into the Percolator and then run documents through it. Since Luwak has a Java API (rather than Elasticsearch’s RESTful API), we wrote a minimal Java app to do the same.

10,000 documents were downloaded from Wikipedia, and 100,000 queries generated for each test. We generated four types of query:

Boolean with 10 required terms and 2 excluded terms
Boolean with 100 required terms and 20 excluded terms
20 required wildcard terms, with a prefix of 4 characters
2-term phrase query with slop of 5

We ran the tests independently, giving Luwak and Elasticsearch a JVM heap size of 8GB, and doing an initial pre-run in order to warm the OS cache (this did not actually have a noticable effect). For sanity, we checked that each document matched the same queries in both Luwak and Percolator.

The results are shown in the graphs below, where the y-axis represents average documents processed per second.

Luwak was consistently faster than Percolator, ranging from a factor of 6 (for the phrase query type) to 40 (for the large Boolean queries).

The reason for this is almost certainly due to Luwak’s presearcher. When a query is added to Luwak, the library generates terms to index the query. For each incoming document, a secondary query is constructed and run against the query index, which returns a subset of the entire query set. Each of these is then run against the document in order to generate the final results. The effect of this is to reduce the number of primary queries which have to be executed against the document, often by a considerable factor (at a relatively small cost of executing the secondary query). Percolator does not have this feature, and by default matches every primary query against every document (it would be possible, but not straightforward, for an application to implement a presearch phase in Percolator). Supporting this analysis, when the Luwak presearcher was disabled its performance dropped to about the same level as Percolator.

These results must be treated with a degree of caution, for several reasons. As already explained, the queries used were randomly generated, and far simpler in structure than typical hand-crafted monitoring queries. Furthermore, the tests were single-threaded and single-sharded, whereas a multithreaded, multi-shard, distributed architecture would be typical for a real-world system. Finally, Elasticsearch Percolator is a service providing a high-level, RESTful API, while Luwak is much lower level, and would require significantly more application-level code to be implemented for a real installation.

However, since both Luwak and Percolator use the same underlying search technology, it is reasonable to conclude that the Luwak presearcher can give it a considerable performance advantage over Percolator.

If you are already using Percolator, should you change? If performance is not a problem now and is unlikely to become a problem, then the effort required is unlikely to be worth it. Luwak is not a drop-in replacement for Percolator. However, if you are planning a data-intensive streaming search system, it would be worth comparing the two. Luwak works well with existing high-performance distributed computing frameworks, which would enable applications using it to scale to very large query sets and document streams.

Our test scripts are available here. We would welcome any attempts to replicate, extend or contest our results.

The post Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations appeared first on Flax.

Innovations in Knowledge Organisation, Singapore: a review

Charlie Hull — Fri, 12 Jun 2015 09:48:41 +0000

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore‘s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge‘s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

The post Innovations in Knowledge Organisation, Singapore: a review appeared first on Flax.

Going international – open source search in London, Berlin & Singapore

Charlie Hull — Fri, 29 May 2015 12:39:51 +0000

We’re travelling a bit over the next few weeks to visit and speak at various events. This weekend Alan Woodward is at Berlin Buzzwords, a hacker-focused conference with a programme full of search talks. He’s not speaking this year, but if you want to talk about Lucene, Solr or our own Luwak stored search library and the crazy things you can do with it, do buy him a beer!

Next week we’re hosting another London Lucene/Solr User Group Meetup with Doug Turnbull of Open Source Connections. Doug is the author of a forthcoming book on Relevant Search and the creator of Quepid, a tool for gathering relevance judgements for Solr-based search systems and then seeing how these scores change as you tune the Solr installation. Tuning relevance is a very common (and often difficult) task during search projects and can make a significant difference to the user experience (and in particular, for e-commerce can hugely affect your bottom line) – so we’re very much looking forward to Doug’s talk.

The week after I’m in Singapore visiting the Innovations in Knowledge Organisation conference – a new event focusing on knowledge management and search. I’ve been asked to talk about open source search and to keynote the second day of the event and speak on ‘The Future of Search’. Do let me know if you’re attending and would like to meet up.

The post Going international – open source search in London, Berlin & Singapore appeared first on Flax.

Elasticsearch London Meetup: Templates, easy log search & lead generation

Charlie Hull — Fri, 30 Jan 2015 14:01:05 +0000

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you’ll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly).

First up was Reuben Sutton of Artirix, describing how his team had moved away from the Elasticsearch Ruby libraries (which can be very slow, mainly due to the time taken to decode/encode data as JSON) towards the relatively new Mustache templating framework. This has allowed them to remove anything complex to do with search from their UI code, although they have had some trouble with Mustache’s support for partial templates. They found documentation was somewhat lacking, but they have contributed some improvements to this.

Next was David Laing of CityIndex describing Logsearch, a powerful way to spin up clusters of ELK (Elasticsearch+Logstash+Kibana) servers for log analysis. Based on the BOSH toolchain and open sourced, this allows CityIndex to create clusters in minutes for handling large amounts of data (they are currently processing 50GB of logs every day). David showed how the system is resilient to server failure and will automatically ‘resurrect’ failed nodes, and interestingly how this enables them to use Amazon spot pricing at around a tenth of the cost of the more stable AWS offerings. I asked how this powerful system might be used in the general case of Elasticsearch cluster management but David said it is targetted at log processing – but of course according to some everything will soon be a log anyway!

The last talk was by Alex Mitchell and Francois Bouet of Growth Intelligence who provide lead generation services. They explained how they have used Elasticsearch at several points in their data flow – as a data store for the web pages they crawl (storing these in both raw and processed form using multi-fields), for feature generation using the term vector API and to encode simple business rules for particular clients – as well as to power the search features of their website, of course.

A short Q&A with some of the Elasticsearch team followed: we heard that the new Shield security plugin has had some third-party testing (the details of which I suggested are published if possible) and a preview of what might appear in the 2.0 release – further improvements to the aggregrations features including derivatives and anomaly detection sound very useful. A swift drink and natter about the world of search with Mark Harwood and it was time to get the train home. Thanks to all the speakers and of course Yann for organising as ever – see you next time!

The post Elasticsearch London Meetup: Templates, easy log search & lead generation appeared first on Flax.

Comparing Solr and Elasticsearch – here's the code we used

Charlie Hull — Tue, 09 Dec 2014 17:00:52 +0000

A couple of weeks ago we presented the initial results of a performance study between Apache Solr and Elasticsearch, carried out by my colleague Tom Mortimer. Over the last few years we’ve tested both engines for client projects and noticed some significant performance differences, which we thought deserved fuller investigation.

Although Flax is partnered with Solr-powered Lucidworks we remain completely independent and have no particular preference for either Solr or Elasticsearch – as Tom says in his slides they’re ‘both awesome’. We’re also not interested in scoring points for or against either engine or the various commercial companies that are support their development; we’re actively using both in client projects with great success. As it turned out, the results of the study showed that performance was broadly comparable, although Solr performed slightly better in filtered searches and seemed to support a much higher maximum queries per second.

We’d like to continue this work, but client projects will be taking a higher priority, so in the hope that others get involved both to verify our results and take the comparison further we’re sharing the code we used as open source. It would also be rather nice if this led to further performance tuning of both engines.

If you’re interested in other comparisons between Solr and Elasticsearch, here are some further links to try.

Do let us know you get on, what you discover and how we might do things better!

The post Comparing Solr and Elasticsearch – here's the code we used appeared first on Flax.

A new Meetup for Lucene & Solr

Charlie Hull — Mon, 01 Dec 2014 13:41:02 +0000

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

The post A new Meetup for Lucene & Solr appeared first on Flax.

London Elasticsearch User Group – September Meetup

Charlie Hull — Thu, 04 Sep 2014 09:43:29 +0000

Last night I joined a good-sized crowd at a venue on Hoxton Square for some talks on Elasticsearch – this Meetup group is very popular and always attracts a good proportion of people new to the world of search, as well as some familiar faces. I started with a quick announcement of our own Elasticsearch hackday in a few weeks time.

First of the speakers was Richard Pijnenburg with a surprisingly brief talk on Puppet and Elasticsearch – brief, because integrating the two is apparently very simple, requiring only a few lines of Puppet code. Some questions from the floor sparked a discussion of combining Puppet and Vagrant for setting up Elasticsearch instances: apparently very soon we’ll see a complete demo instance of Elasticsearch built using these technologies and including some example data, which will be very useful for those wanting to get started with the engine (here’s some more on this combination).

Next was Amit Talhan, ably assisted by Geza Kerekes, both from AlignAlytics who have been using Elasticsearch both as a data store, reporting store and more recently for analysing data from a survey of all the retail outlets in Nigeria. Generating a wealth of data across up to 1000 fields, including geolocation data harvested every five seconds, this survey could have been difficult if not impossible to handle using a traditional SQL database, but many of their colleagues were very used to SQL syntax and methods for analyzing data. Amit and Geza explained how they have used Elasticsearch and in particular aggregations to provide functionality such as checking for bad reporting by surveyors and unexpectedly high density areas (such as markets, where there may be 200 retail outlets in a few square metres). One challenge seems to have been how to explain to colleagues from the data analysis community that Elasticsearch can provide some, but not all of the functionality of a traditional database, but that alternative ways of indexing and querying data can be used to solve the same problems. Interestingly, performance testing by AlignAlytics proved that BigStep, a provider of ‘bare metal’ cloud hosting, could provide much better performance than their own dedicated servers.

Next was Mark Harwood with another of his fascinating investigations into how Elasticsearch can be used for analysis of user behaviour, showing how after a bad personal experience buying a new battery that turned out to be second-hand, he identified Amazon.com vendors with suspiciously positive reviews. He also discussed how behaviour-based term suggesters might be built using Elasticsearch’s significant_terms aggregration. His demonstration did remind me slightly of Xapian’s relevance feedback feature. I heard several people later say that they wished they had time for some of the fun projects Mark seems to work on!

The event finished with some lively discussion and some free pizza courtesy of Elasticsearch (the company). Thanks to Yann Cluchey as ever for organising the event and I look forward to seeing a few of the attendees in Cambridge soon – we’re only an hour or so by train from Cambridge plus a ten minute walk to the venue, so it should be an easy trip!

The post London Elasticsearch User Group – September Meetup appeared first on Flax.