Posts Tagged ‘elasticsearch’

Elastic London User Group Meetup – scaling with Kafka and Cassandra

The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.

Rebrands and changing times for Elasticsearch

I’ve always been careful to distinguish between Elasticsearch (the open source search server based on Lucene) and Elasticsearch (the company formed by the authors of the former) and it seems someone was listening, as the latter has now rebranded as simply Elastic. This was one of the big announcements during their first conference, the other being that after acquiring Norwegian company Found they are now offering a fully hosted Elasticsearch-as-a-service (congratulations to Alex and others at Found!). As Ben Kepes of Forbes writes, this may be something to do with ‘managing tensions within the ecosystem’ (I’ve written previously on how this ecosystem is expanding to include closed-source commercial products, which may make open source enthusiasts nervous) but it’s also an attempt to move away from ’search’ into a wider area encompassing the buzzwords-de-jour of Big Data Analytics.

In any case, it’s clear that Elastic (the company, and that’s hopefully the last time I’ll have to write this!) have a clear strategy for the future – to provide many different commercial options for Elasticsearch and its related projects for as many different use cases as possible. Of course, you can still take the open source route, which we’re helping several clients with at present – I hope to be able to present a case study on this very soon.

Meanwhile, Martin White has identified how a recent book on Elasticsearch describes literally hundreds of features and that ‘The skill lies in knowing which to implement given the nature of the content and the type of query that will be used’ – effective search, as ever, remains a difficult thing to get right, no matter what technology option you choose.

UPDATE: It seems that www.elasticsearch.org, the website for the open source project, is now redirecting to the commercial company website…there is now a new Github page for open source code at https://github.com/elastic

Tags: , , , ,

Posted in News

March 11th, 2015

1 Comment »

A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access

Stephen Arnold, whose blog I enjoy due to its unabashed cynicism about overenthusiastic marketing of search technology, was kind enough to send me a copy of his recent report on CyberOSINT & Next Generation Information Access (NGIA), the latter being a term he has recently coined. OSINT itself refers to intelligence gathered from open, publically available sources, not anything to do with software licenses – so yes, this is all about the NSA, CIA and others, who as you might expect are keen on anything that can filter out the interesting from the noise. Let’s leave the definition (and the moral questionability) of ‘publically available’ aside for now – even if you disagree with its motives, this is a use case which can inform anyone with search requirements of the state of the art and what the future holds.

The report starts off with a foreword by Robert David Steele, who has had a varied and interesting career and lately has become a cheerleader for the other kind of open source – software – as a foundation for intelligence gathering. His view is that the tools used by the intelligence agencies ‘are also not good enough’ and ‘We have a very long way to go’. Although he writes that ‘the systems described in this volume have something to offer’ he later concludes that ‘This monograph is a starting point for those who might wish to demand a “full spectrum” solution, one that is 100% open source, and thus affordable, interoperable, and scalable.’ So for those of us in the open source sector, we could consider Arnold’s report as a good indicator of what to shoot for, a snapshot of the state of the art in search.

Arnold then starts the report with some explanation of the NGIA concept. This is largely a list of the common failings of traditional search platforms (basic keyword search, oft-confusing syntax, separate silos of information, lack of multimedia features and personalization) and how they might be addressed (natural language search, automatic querying, federated search, analytics). I am unconvinced this is as big a step as Arnold suggests though: it seems rather to imply that all past search systems were badly set up and configured and somehow a NGIA system will magically pull everything together for you and tell you the answer to questions you hadn’t even asked yet.

Disappointingly the exemplar chosen in the next chapter is Autonomy IDOL: regular readers will not be surprised by my feelings about this technology. Arnold suggests the creation of the Autonomy software was influenced by cracking World War II codes, rock music and artificial intelligence, which is in my mind adding egg to an already very eggy pudding, and not in step with what I know about the background of Cambridge Neurodynamics (Autonomy’s progenitor, created very soon after – and across the corridor from – Muscat, another Cambridge Bayesian search technology firm where Flax’s founders cut their teeth on search). In particular, Autonomy’s Kenjin tool – which automatically suggested related documents – is identified as a NGIA feature, although at the time I remember it being reminiscent of features we had built a year earlier at Muscat – we even applied for a patent. Arnold does note that ‘[Autonomy founder, Mike] Lynch and his colleagues clamped down on information about the inner workings of its smart software.’ and ‘The Autonomy approach locks down the IDOL components.’ – this was a magic black box of course, with a magically increasing price tag as well. The price tag rose to ridiculous dimensions (even after an equally ridiculous writedown) when Hewlett Packard bought the company.

The report continues with analysis of various other potential NGIA contenders, including Google-funded timeline analysis specialists Recorded Future and BAE Detica – interestingly one of the search specialists from this British company has now gone on to work at Elasticsearch.

The report concludes with a look at the future, correctly identifying advanced analytics as one key future trend. However this conclusion also echoes the foreword, with ‘The cost of proprietary licensing, maintenance, and training is now killing the marketplace. Open source alternatives will emerge, and among these may be a 900 pound gorilla that is free, interoperable and scalable.’. Although I have my issues with some of the examples chosen, the report will be very useful I’m sure to those in the intelligence sector, who like many are still looking for search that works.

Elasticsearch London Meetup: Templates, easy log search & lead generation

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you’ll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly).

First up was Reuben Sutton of Artirix, describing how his team had moved away from the Elasticsearch Ruby libraries (which can be very slow, mainly due to the time taken to decode/encode data as JSON) towards the relatively new Mustache templating framework. This has allowed them to remove anything complex to do with search from their UI code, although they have had some trouble with Mustache’s support for partial templates. They found documentation was somewhat lacking, but they have contributed some improvements to this.

Next was David Laing of CityIndex describing Logsearch, a powerful way to spin up clusters of ELK (Elasticsearch+Logstash+Kibana) servers for log analysis. Based on the BOSH toolchain and open sourced, this allows CityIndex to create clusters in minutes for handling large amounts of data (they are currently processing 50GB of logs every day). David showed how the system is resilient to server failure and will automatically ‘resurrect’ failed nodes, and interestingly how this enables them to use Amazon spot pricing at around a tenth of the cost of the more stable AWS offerings. I asked how this powerful system might be used in the general case of Elasticsearch cluster management but David said it is targetted at log processing – but of course according to some everything will soon be a log anyway!

The last talk was by Alex Mitchell and Francois Bouet of Growth Intelligence who provide lead generation services. They explained how they have used Elasticsearch at several points in their data flow – as a data store for the web pages they crawl (storing these in both raw and processed form using multi-fields), for feature generation using the term vector API and to encode simple business rules for particular clients – as well as to power the search features of their website, of course.

A short Q&A with some of the Elasticsearch team followed: we heard that the new Shield security plugin has had some third-party testing (the details of which I suggested are published if possible) and a preview of what might appear in the 2.0 release – further improvements to the aggregrations features including derivatives and anomaly detection sound very useful. A swift drink and natter about the world of search with Mark Harwood and it was time to get the train home. Thanks to all the speakers and of course Yann for organising as ever – see you next time!

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

Comparing Solr and Elasticsearch – here’s the code we used

A couple of weeks ago we presented the initial results of a performance study between Apache Solr and Elasticsearch, carried out by my colleague Tom Mortimer. Over the last few years we’ve tested both engines for client projects and noticed some significant performance differences, which we thought deserved fuller investigation.

Although Flax is partnered with Solr-powered Lucidworks we remain completely independent and have no particular preference for either Solr or Elasticsearch – as Tom says in his slides they’re ‘both awesome’. We’re also not interested in scoring points for or against either engine or the various commercial companies that are support their development; we’re actively using both in client projects with great success. As it turned out, the results of the study showed that performance was broadly comparable, although Solr performed slightly better in filtered searches and seemed to support a much higher maximum queries per second.

We’d like to continue this work, but client projects will be taking a higher priority, so in the hope that others get involved both to verify our results and take the comparison further we’re sharing the code we used as open source. It would also be rather nice if this led to further performance tuning of both engines.

If you’re interested in other comparisons between Solr and Elasticsearch, here are some further links to try.

Do let us know you get on, what you discover and how we might do things better!

Searching & monitoring the Unified Log

This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

Search Solutions 2015 – Is semantic search finally here?

Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.

Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.

Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.

The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.

Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.

After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!

A new Meetup for Lucene & Solr

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

More than an API – the real third wave of search technology

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including Found.no and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.