Posts Tagged ‘events’

Elastic London User Group Meetup – scaling with Kafka and Cassandra

The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.

IntraTeam 2015 – a brief visit

Last week I dropped in on the IntraTeam 2015 conference in Copenhagen, an event focused on intranets with some content on enterprise search. After a rather pleasant evening of Thai food and networking I attended the last day of the event. The keynote speaker was Dave Snowden, who has an amusing and rather curmudgeonly style of presentation, making sure to note the previous presenters he’d disagreed with for their over-reliance on simplistic concepts of knowledge and how the brain works. His talk was however very interesting and introduced the Cynevin framework (a Welsh word which apparently refers to homing sheep!). He also discussed how the rush to digitisation has had a cost in terms of human cognition, how the concept of an intranet will soon disappear (a brave assertion at an intranet conference) and how future systems should perhaps use storytelling metaphors – with some great examples of how collecting these micro-narratives from employees and others can produce extremely rapid feedback on the health of a business.

Andreas Hallgren of Chalmers University showed the evolution of their site-wide search facility, now based on Apache Solr. Unsurprisingly one of the main problems was determining who ‘owns’ search in their organisation: at least now they have a staff member who dedicates 25% of their time to improving search. He had some interesting points about the seasonality of academic searches and how analytics can be used to ‘measure more, guess less’. I was up next talking about Search Turned Upside Down, using a similar set of slides to this one: thanks to all who came and asked some great questions.

Next was Helen Lippell who I have heard speak before on how to get Enterprise Search right – Helen had some great anecdotes and guidance for an attentive audience. Ed Dale followed with five tips for great search: index the right content, optimise this content, measure search, make a great UI and listen to your users – I can only agree! He also characterised the different kinds of content including the worrying ‘content we think we have but we don’t’. The last presentation I attended was by Anders Quitzau of IBM on their fascinating Watson technology: sadly this was a rather marketing-heavy set of slides, with plenty of newly minted buzzwords such as Cognitive Computing and very little useful detail.

Thanks to Kurt Kragh Sorenson and Kristian Norling for inviting me to speak and attend the conference, next time I hope to see a little more of the event!

Lucene/Solr London User Group – Alfresco & Datastax

We had another London user group Meetup last week, hosted by Reed.co.uk who also provided some tasty pizza – eaten under the ‘Love Mondays’ sign from their adverts, which now lives in their boardroom! A few new faces this time and a couple of great talks from two companies who have incorporated Solr into their platforms.

First up was Andy Hind, a founding developer of document management company Alfresco, who told us all about how they originally based their search capability on Lucene 2.4, then moved to Solr 4.4 and most recently version 4.9.1. Using Solr they have implemented often complex security requirements (originally using a PostFilter as Erik Hatcher describes and more recently in the query itself), structured queries (using Phrase and SpanQueries) and their own domain specific query language (DSL) – they can support SQL-like, Lucene and Google-like queries by passing them through parsers based on ANTLR to be served either by the search engine or whatever relational database Alfresco is using. The move to a recent version of Solr has allowed the most recent release of Alfresco to support various modern search features (facets, spelling suggestions etc.) but Andy did mention that so far they are not using SolrCloud for scaling, preferring to manage this themselves.

Next up was Sergio Bossa of Datastax, talking about how their Datastax Enterprise (DSE) product incorporates Solr searching within an Apache Cassandra cluster. Sergio has previously spoken at our Cambridge search meetup on a very similar subject, so I won’t repeat myself here, but the key point is that Solr lives directly on top of the Cassandra cluster, so you don’t have to worry about it at all – search features are directly available from the Cassandra APIs. Like Alfresco, this is an alternative to SolrCloud (assuming you also need a NoSQL database of course!).

Thanks again to Alex Rice for hosting the Meetup, to both our speakers and to all who came – we’ll return soon! In the meantime you may want to check out a few events coming later this year: Berlin Buzzwords, ApacheCon Europe and Lucene/Solr Revolution.

Tags: , , , ,

Posted in Technical, events

February 16th, 2015

No Comments »

Elasticsearch London Meetup: Templates, easy log search & lead generation

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you’ll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly).

First up was Reuben Sutton of Artirix, describing how his team had moved away from the Elasticsearch Ruby libraries (which can be very slow, mainly due to the time taken to decode/encode data as JSON) towards the relatively new Mustache templating framework. This has allowed them to remove anything complex to do with search from their UI code, although they have had some trouble with Mustache’s support for partial templates. They found documentation was somewhat lacking, but they have contributed some improvements to this.

Next was David Laing of CityIndex describing Logsearch, a powerful way to spin up clusters of ELK (Elasticsearch+Logstash+Kibana) servers for log analysis. Based on the BOSH toolchain and open sourced, this allows CityIndex to create clusters in minutes for handling large amounts of data (they are currently processing 50GB of logs every day). David showed how the system is resilient to server failure and will automatically ‘resurrect’ failed nodes, and interestingly how this enables them to use Amazon spot pricing at around a tenth of the cost of the more stable AWS offerings. I asked how this powerful system might be used in the general case of Elasticsearch cluster management but David said it is targetted at log processing – but of course according to some everything will soon be a log anyway!

The last talk was by Alex Mitchell and Francois Bouet of Growth Intelligence who provide lead generation services. They explained how they have used Elasticsearch at several points in their data flow – as a data store for the web pages they crawl (storing these in both raw and processed form using multi-fields), for feature generation using the term vector API and to encode simple business rules for particular clients – as well as to power the search features of their website, of course.

A short Q&A with some of the Elasticsearch team followed: we heard that the new Shield security plugin has had some third-party testing (the details of which I suggested are published if possible) and a preview of what might appear in the 2.0 release – further improvements to the aggregrations features including derivatives and anomaly detection sound very useful. A swift drink and natter about the world of search with Mark Harwood and it was time to get the train home. Thanks to all the speakers and of course Yann for organising as ever – see you next time!

Out and about in January and February

We’re speaking at a couple of events soon: if you’re in London and interested in Apache Lucene/Solr we’re also planning another London User Group Meetup soon.

Firstly my colleague Alan Woodward is speaking with Martin Kleppman at FOSDEM in Brussels (31st January-1st February) on Searching over streams with Luwak and Apache Samza – about some fascinating work they’ve been doing to combine the powerful ‘reverse search’ facilities of our Luwak library with Apache Samza’s distributed, stream-based processing. We’re hoping this means we can scale Luwak beyond its current limits (although those limits are pretty accomodating, as we know of systems where a million or so stored searches are applied to a million incoming messages every day). If you’re interested in open source search the Devroom they’re speaking in has lots of other great talks planned.

Next I’m talking about the wider applications of this kind of reverse search in the area of media monitoring, and how open source software in general can help you turn your organisation’s infrastructure upside down, at the Intrateam conference event in Copenhagen from February 24th-26th. Scroll down to find my talk at 11.35 am on Thursday 26th.

If you’d like to meet us at either of these events do get in touch.

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

Searching & monitoring the Unified Log

This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

Search Solutions 2015 – Is semantic search finally here?

Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.

Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.

Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.

The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.

Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.

After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!

A new Meetup for Lucene & Solr

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

Enterprise Search & Discovery 2014, Washington DC

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

Tags: , , , , , ,

Posted in events

November 12th, 2014

No Comments »