Archive for the ‘events’ Category

Elasticsearch London user group – The Guardian & Orchestrate test the limits

Last week I popped into the Elasticsearch London meetup, hosted this time by The Guardian newspaper. Interestingly, the overall theme of this event was not just what the (very capable and flexible) Elasticsearch software is capable of, but also how things can go wrong and what to do about it.

Jenny Sivapalan and Mariot Chauvin from the Guardian’s technical team described how Elasticsearch powers the Content API, used not just for the newspaper’s own website but internally and by third party applications. Originally this was built on Apache Solr (I heard about this the last time I attended a search meetup at the Guardian) but this system was proving difficult to scale elastically, taking a few minutes before new content was available and around an hour to add a new server. Instead of upgrading to SolrCloud (which probably would have solved some of these issues) the team decided to move to Elasticsearch with targets of less than 5 seconds for new content to become live and generally a quicker response to traffic peaks. The team were honest about what had gone wrong during this process: oversharding led to problems caused by Java garbage collection, some of the characteristics of the Amazon cloud hosting used (in particular, unexpected server shutdowns for maintenance) required significant tweaking of the Elasticsearch startup process and they were keen to stress that scripting must be disabled unless you want your search servers to be an easy target for hackers. Although Elasticsearch promises that version upgrades can usually be done on a live cluster, the Guardian team found this unreliable in a majority of cases. Their eventual solution for version upgrades and even more simple configuration changes was to spin up an entirely new cluster of servers, switch over by changing DNS settings and then to turn off the old cluster. They have achieved their performance targets though, with around 375 requests/second supported and less than 15 minutes for a failed node to recover.

After a brief presentation from Colin Goodheart-Smithe of Elasticsearch (the company) on scripted aggregrations – a clever way to gather statistics, but possibly rather fiddly to debug – we moved on to Ian Plosker of Orchestrate.io, who provide a ‘database as a service’ backed by HBase, Elasticsearch and other technologies, and his presentation on Schemalessness Gone Wrong. Elasticsearch allows you submit data for indexing without pre-defining a schema – but Ian demonstrated how this feature isn’t very reliable in practice and how his team had worked around it but creating a ‘tuplewise transform’, restructuring data into pairs of ‘field name, field value’ before indexing with Elasticsearch. Ian was questioned on how this might affect term statistics and thus relevance metrics (which it will) but replied that this probably won’t matter – it won’t for most situations I expect, but it’s something to be aware of. There’s much more on this at Orchestrate’s own blog.

We finished up with the usual Q&A which this time featured some hard questions for the Elasticsearch team to answer – for example why they have rolled their own distributed configuration system rather than used the proven Zookeeper. I asked what’s going to happen to the easily embeddable Kibana 3 now Kibana 4 has its own web application (the answer being that it will probably not be developed further) and also about the licensing and availability of their upcoming Shield security plugin for Elasticsearch. Interestingly this won’t be something you can buy as a product, rather it will only be available to support customers on the Gold and Platinum support subscriptions. It’s clear that although Elasticsearch the search engine should remain open source, we’re increasingly going to see parts of its ecosystem that aren’t – users should be aware of this, and that the future of the platform will very much depend on the business direction of Elasticsearch the company, who also centrally control the content of the open source releases (in contrast to Solr which is managed by the Apache Foundation).

Elasticsearch meetups will be more frequent next year – thanks Yann Cluchey for organising and to all the speakers and the Elasticsearch team, see you again soon I hope.

Searching & monitoring the Unified Log

This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

Search Solutions 2015 – Is semantic search finally here?

Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.

Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.

Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.

The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.

Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.

After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!

A new Meetup for Lucene & Solr

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

Enterprise Search & Discovery 2014, Washington DC

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

Tags: , , , , , ,

Posted in events

November 12th, 2014

No Comments »

Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup

It’s looking like a busy Autumn for search events – first, I’m presenting at Enterprise Search & Discovery 2014 in Washington DC on November 5th, talking about ‘Turning Search Upside Down with open source software’. I’ll be describing how we’ve replaced various underperforming, big name closed source search engines with faster & more scalable open source technology, including our own Luwak stored query engine. Do let me know if you’re in DC, I’d be very happy to meet up. The week after this is Lucene Revolution, which sadly we won’t be attending this year, but it is recommended if you’re interested in Lucene and Solr.

Towards the end of November there’s Search Solutions, a great day of presentations about all aspects of search held at the British Computer Society in Covent Garden. This year Tom Mortimer from Flax will be presenting some research we’ve done into performance comparisons between Lucene/Solr and Elasticsearch, and there are also presentations from Thomson Reuters, the British Library, Microsoft, Yahoo! and Google. I highly recommend this event, it’s always worth attending.

We’re also starting a new Meetup in London, a group for users of Apache Lucene/Solr (there’s an Elasticsearch London user group but strangely no equivalent for the other popular stack). Our first event is on November 28th, kindly hosted by Bloomberg (who are no strangers to Lucene/Solr themselves) and featuring Shalin Mangar, a Lucene/Solr committer from Lucidworks who is visiting Europe that week. We’re hoping that we can run these events every few months, but we need help from the community, so if you could talk, sponsor or host the Meetups do let us know.

In December we’ll be holding another Cambridge Search Meetup and will be talking about our work with the European Bioinformatics Institute on the BioSolr project – the date to be confirmed. Busy times!

Cambridge Search Meetup – Elasticsearch Hackday

Last Friday we hosted a hackday featuring Elasticsearch in Cambridge, following a similar event last year focused on Apache Lucene/Solr. Around 20 people attended from organisations working in sectors including analytics, digital music, bioinformatics and e-commerce, and all the Flax team were there as well.

We started with a brief presentation on Elasticsearch and asked around the room for any data collections we might be able to use. Lee from Elasticsearch (the company) had brought collections of UK crime data and the complete works of Shakespeare; we also had several million rows of digital music metadata, Wikipedia edit data for all UK MPs (to follow last year’s theme!) and several years of data describing Premier League football. Unlike our Solr hackday where each team worked on the same general task, this time we split into four different teams who worked on all of the above except the Wikipedia edits. We’d also been provided with a very high-performance Elasticsearch cluster by BigStep for our use, which meant it was very quick to index the above data and start working with it.

By lunchtime (the food was sponsored by Elasticsearch, who also provided stickers, plush ELKs and lollypops – thanks guys!) we had some very basic information about the various datasets – such as which scene in which Shakespeare play has the most characters on stage (the answer is 21 in Richard III), and which football teams seemed to gain the most advantage from playing at home. Note that we had already moved beyond basic search functionality to use Elasticsearch as an analytic platform, answering particular questions, using features such as aggregations.

We continued during the afternoon to develop the various applications and finished with a ’show and tell’. Some of the teams had managed to develop user interfaces for Elasticsearch, the most polished being a clickable Google Map that would show you which types of crime were significantly above and below the national average for the area you selected – unsurprisingly in Cambridge, stolen bicycles were very common! By the end of the day, everyone had gained experience of Elasticsearch, some for the first time. We finished the day, as is traditional, with a swift pint and further networking.

Thanks to Cambridge Business Lounge (a highly recommended co-working space) for the venue, BigStep for hosting and Elasticsearch for sponsoring lunch and providing the swag, and of course to all who attended. We’ll return with a further Cambridge Search Meetup soon!

BioSolr begins with a workshop day

Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

London Elasticsearch User Group – September Meetup

Last night I joined a good-sized crowd at a venue on Hoxton Square for some talks on Elasticsearch – this Meetup group is very popular and always attracts a good proportion of people new to the world of search, as well as some familiar faces. I started with a quick announcement of our own Elasticsearch hackday in a few weeks time.

First of the speakers was Richard Pijnenburg with a surprisingly brief talk on Puppet and Elasticsearch – brief, because integrating the two is apparently very simple, requiring only a few lines of Puppet code. Some questions from the floor sparked a discussion of combining Puppet and Vagrant for setting up Elasticsearch instances: apparently very soon we’ll see a complete demo instance of Elasticsearch built using these technologies and including some example data, which will be very useful for those wanting to get started with the engine (here’s some more on this combination).

Next was Amit Talhan, ably assisted by Geza Kerekes, both from AlignAlytics who have been using Elasticsearch both as a data store, reporting store and more recently for analysing data from a survey of all the retail outlets in Nigeria. Generating a wealth of data across up to 1000 fields, including geolocation data harvested every five seconds, this survey could have been difficult if not impossible to handle using a traditional SQL database, but many of their colleagues were very used to SQL syntax and methods for analyzing data. Amit and Geza explained how they have used Elasticsearch and in particular aggregations to provide functionality such as checking for bad reporting by surveyors and unexpectedly high density areas (such as markets, where there may be 200 retail outlets in a few square metres). One challenge seems to have been how to explain to colleagues from the data analysis community that Elasticsearch can provide some, but not all of the functionality of a traditional database, but that alternative ways of indexing and querying data can be used to solve the same problems. Interestingly, performance testing by AlignAlytics proved that BigStep, a provider of ‘bare metal’ cloud hosting, could provide much better performance than their own dedicated servers.

Next was Mark Harwood with another of his fascinating investigations into how Elasticsearch can be used for analysis of user behaviour, showing how after a bad personal experience buying a new battery that turned out to be second-hand, he identified Amazon.com vendors with suspiciously positive reviews. He also discussed how behaviour-based term suggesters might be built using Elasticsearch’s significant_terms aggregration. His demonstration did remind me slightly of Xapian’s relevance feedback feature. I heard several people later say that they wished they had time for some of the fun projects Mark seems to work on!

The event finished with some lively discussion and some free pizza courtesy of Elasticsearch (the company). Thanks to Yann Cluchey as ever for organising the event and I look forward to seeing a few of the attendees in Cambridge soon – we’re only an hour or so by train from Cambridge plus a ten minute walk to the venue, so it should be an easy trip!

Cambridge Search Meetup – Knowledge Discovery & Wayfinding

We were lucky enough to have two speakers from Cambridge text mining company Linguamatics at last night’s Meetup. Robin Newton kicked us off with an amusing and idiosyncratic view of the uses and mis-uses of search – starting with the problem that when you have text search software, every problem can look like search might solve it. He gave an example of his recent search for a new job: although matching his skills on paper with a potential employer’s needs is one thing, he also wants to be sure the employer ‘isn’t a crook’! With reference to Tyler Tate’s talks on Information Wayfinding, which in turn quotes urban planner Kevin Lynch, Robin told us how he felt that search ‘journeys’ weren’t always the most efficient way to discover an answer: his assertion was that finding a person who could tell you was more useful. Since even in the most efficient and well-run organisation not all information is held in documents one might agree that finding an ‘expert’ is the best way to get the answers one needs. He finished with a welcome message that informal networking in pubs and cafes (much like our Meetup) helps share a lot of very useful information – and this is how he eventually decided that Linguamatics was going to be a great place to work.

Next was CTO and co-founder of Linguamatics, Dr David Milward, who described his company’s capability in text mining, Natural Language Processing (NLP) and search. He described the challenges of extracting ‘concepts’ from text – how words and acronyms with multiple potential meanings are difficult to parse automatically without contextual knowledge. Linguamatics’ approach has been described as ‘Agile NLP’ and allows the quick development of new patterns for concept extraction. A powerful example he gave was how by specifying a relationship between two entities, in this case one company acquiring another, structured data can be extracted from unstructured text. Other examples focused on the medical and bioscience field (a particular interest of ours at present due to the upcoming BioSolr project) and showed how their software can cluster facts and find connections between disparate pieces of data (‘which X relates to Y via Z’). This process can also be used to generate new facets for searching from free text, including for numeric ranges, and these can even be tailored for different user groups. It’s clear that Linguamatics are experts in this area and David’s talk was of great interest to many in the room, including several from the European Bioinformatics Institute.

We finished with the usual chat, networking and drinks. Thanks to both our speakers – and do let me know if you have a suggestion for a presentation at a future event!