Elasticsearch London Meetup: Templates, easy log search & lead generation

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you’ll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly).

First up was Reuben Sutton of Artirix, describing how his team had moved away from the Elasticsearch Ruby libraries (which can be very slow, mainly due to the time taken to decode/encode data as JSON) towards the relatively new Mustache templating framework. This has allowed them to remove anything complex to do with search from their UI code, although they have had some trouble with Mustache’s support for partial templates. They found documentation was somewhat lacking, but they have contributed some improvements to this.

Next was David Laing of CityIndex describing Logsearch, a powerful way to spin up clusters of ELK (Elasticsearch+Logstash+Kibana) servers for log analysis. Based on the BOSH toolchain and open sourced, this allows CityIndex to create clusters in minutes for handling large amounts of data (they are currently processing 50GB of logs every day). David showed how the system is resilient to server failure and will automatically ‘resurrect’ failed nodes, and interestingly how this enables them to use Amazon spot pricing at around a tenth of the cost of the more stable AWS offerings. I asked how this powerful system might be used in the general case of Elasticsearch cluster management but David said it is targetted at log processing – but of course according to some everything will soon be a log anyway!

The last talk was by Alex Mitchell and Francois Bouet of Growth Intelligence who provide lead generation services. They explained how they have used Elasticsearch at several points in their data flow – as a data store for the web pages they crawl (storing these in both raw and processed form using multi-fields), for feature generation using the term vector API and to encode simple business rules for particular clients – as well as to power the search features of their website, of course.

A short Q&A with some of the Elasticsearch team followed: we heard that the new Shield security plugin has had some third-party testing (the details of which I suggested are published if possible) and a preview of what might appear in the 2.0 release – further improvements to the aggregrations features including derivatives and anomaly detection sound very useful. A swift drink and natter about the world of search with Mark Harwood and it was time to get the train home. Thanks to all the speakers and of course Yann for organising as ever – see you next time!

Out and about in January and February

We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon.

Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels (31st January-1st February) on Searching over streams with Luwak and Apache Samza – about some fascinating work they’ve been doing to combine the powerful ‘reverse search’ facilities of our Luwak library with Apache Samza’s distributed, stream-based processing. We’re hoping this means we can scale Luwak beyond its current limits (although those limits are pretty accomodating, as we know of systems where a million or so stored searches are applied to a million incoming messages every day). If you’re interested in open source search the Devroom they’re speaking in has lots of other great talks planned.

Next I’m talking about the wider applications of this kind of reverse search in the area of media monitoring, and how open source software in general can help you turn your organisation’s infrastructure upside down, at the Intrateam conference event in Copenhagen from February 24th-26th. Scroll down to find my talk at 11.35 am on Thursday 26th.

If you’d like to meet us at either of these events do get in touch.

Solr Superclusters for improved federated search

As part of our BioSolr project, we’ve been discussing how best to create a federated search over several Apache Solr instances. In this case various research institutions across the world are annotating data objects representing proteins and it would be useful to search not just the original protein data, but what others have added to the body of knowledge. If an institution wants to use the annotations, the usual approach is to download the extra data regularly and add it into a local Solr index.

Luckily Solr is widely used in the bioinformatics community so we have commonality in the query API. The question is would it be possible to use some of the distributed querying capabilities of SolrCloud to search not just the shards of a single index, but a group of Solr/SolrCloud indices – a supercluster.

This is a bit like a standard federated search, where queries are farmed out to various disparate search engines and the results then combined and displayed in some fashion. However, since we are sharing a single technology, powerful features such as result grouping would be possible.

For this to work at all, there would need to be some agreed standard between the various Solr systems: a globally unique record identifier for example (possibly implemented with a prefix unique to each institution). Any data that was required for result grouping would have to share a schema across the entire supercluster – let’s call this the primary schema – but basic searching and faceting could still be carried out over data with a differing, secondary schema. Solr dynamic fields might be useful for this secondary schema.

Luckily, research institutions are used to working as part of a consortium, and one of the conditions for joining would be agreeing to some common standards. A single Solr query API would then be available to all members of the consortium, to search not just their own data but everything available from their partners, without the slow and error-prone process of copying the data for local indexing.

We’re currently evaluating the feasibility of this idea and would welcome input from others – let us know what you think in the comments!

Tags: , , ,

Posted in Technical

January 20th, 2015

1 Comment »

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

Comparing Solr and Elasticsearch – here’s the code we used

A couple of weeks ago we presented the initial results of a performance study between Apache Solr and Elasticsearch, carried out by my colleague Tom Mortimer. Over the last few years we’ve tested both engines for client projects and noticed some significant performance differences, which we thought deserved fuller investigation.

Although Flax is partnered with Solr-powered Lucidworks we remain completely independent and have no particular preference for either Solr or Elasticsearch – as Tom says in his slides they’re ‘both awesome’. We’re also not interested in scoring points for or against either engine or the various commercial companies that are support their development; we’re actively using both in client projects with great success. As it turned out, the results of the study showed that performance was broadly comparable, although Solr performed slightly better in filtered searches and seemed to support a much higher maximum queries per second.

We’d like to continue this work, but client projects will be taking a higher priority, so in the hope that others get involved both to verify our results and take the comparison further we’re sharing the code we used as open source. It would also be rather nice if this led to further performance tuning of both engines.

If you’re interested in other comparisons between Solr and Elasticsearch, here are some further links to try.

Do let us know you get on, what you discover and how we might do things better!

Searching & monitoring the Unified Log

This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

Search Solutions 2015 – Is semantic search finally here?

Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.

Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.

Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.

The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.

Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.

After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!

A new Meetup for Lucene & Solr

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

More than an API – the real third wave of search technology

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including Found.no and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.

Enterprise Search & Discovery 2014, Washington DC

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

Tags: , , , , , ,

Posted in events

November 12th, 2014

No Comments »