Open source search events roundup for late 2015

Although it’s still high summer here in the UK (which means it’s probably raining) we’re already looking forward to the autumn and the events across the world we’re attending. In early September we’re running another free to attend London Lucene/Solr Usergroup Meetup, sponsored this time by Blackrock who are talking about using Solr for websites. At the end of September there is another Elasticsearch London Meetup which we will also attend (and may be speaking at this time).

October brings the biggest event in the Lucene/Solr calendar, Lucene Revolution in Austin, Texas, a 4-day event with training and a conference. We’re happy to announce that Alan Woodward and Matt Pearce from Flax will be presenting “Searching the Stuff of Life: BioSolr” about our work with the European Bioinformatics Institute where we’ve been developing Solr features for use by bioinformaticians (and any others who find them useful of course!), for example ontology indexing and external JOINs.

A week later we’ll be at Enterprise Search Europe, where I’ll be delivering the keynote on The Future of Search (you can see an earlier version of this talk from the IKO Singapore conference last month). We’re also running a Meetup on the evening of the 20th open to both conference attendees and others – an informal chance to chat with other search folks. During the conference itself I’m particularly looking forward to hearing from Ian Williams of NHS Wales on Powering the Single Patient Record in NHS Wales with Apache Solr – this is a very large scale and exciting project using Solr for healthcare data.

Looking further ahead, in November we have plans to attend (and possibly speak) at Search Solutions 2015, a great one-day event in London which I highly recommend, and we are planning another event in Singapore together with a partner. As ever, do let us know if you would like to meet up at an event and talk open source search!

Elasticsearch Percolator & Luwak: a performance comparison of streamed search implementations

Most search applications work by indexing a relatively stable collection of documents and then allowing users to perform ad-hoc searches to retrieve relevant documents. However, in some cases it is useful to turn this model on its head, and match individual documents against a collection of saved queries. I shall refer to this model as “streamed search”.

One example of streamed search is in media monitoring. The monitoring agency’s client’s interests are represented by stored queries. Incoming documents (e.g. from the Twitter firehose) are matched against the stored queries, and hits are returned for further processing before being reported to the client. Another example is financial news monitoring, to predict share price movements.

In both these examples, queries may be extremely complex (in order to improve the accuracy of hits). There may be hundreds of thousands of stored queries, and documents to be matched may be incoming at a rate of hundreds or thousands per second. Not surprisingly, streamed search can be a demanding task, and the computing resources required to support it a significant expense. There is therefore a need for the software to be as performant and efficient as possible.

The two leading open source streamed search implementations are Elasticsearch Percolator, and Luwak. Both depend on the Lucene search engine. As the developers of Luwak, we have an interest in how its performance compares with Percolator. We therefore carried out some preliminary testing.

Ideally, we would have used real media monitoring queries and documents. However, these are typically protected by copyright, and the queries represent a fundamental asset of monitoring companies. In order to make the tests distributable, we chose to use freely dowloadable documents from Wikipedia, and to generate random queries. These queries were much simpler in structure than the often deeply nested queries from real applications, but we believe that they still provide a useful comparison.

The tests were carried out on an Amazon EC2 r3.large VM running Ubuntu. We wrote a Python script to download, parse and store random Wikipedia articles, and another to generate random queries from the text. The query generator was designed to be somewhat “realistic”, in that each query should match more than zero documents. For Elasticsearch, we wrote scripts to index queries into the Percolator and then run documents through it. Since Luwak has a Java API (rather than Elasticsearch’s RESTful API), we wrote a minimal Java app to do the same.

10,000 documents were downloaded from Wikipedia, and 100,000 queries generated for each test. We generated four types of query:

  • Boolean with 10 required terms and 2 excluded terms
  • Boolean with 100 required terms and 20 excluded terms
  • 20 required wildcard terms, with a prefix of 4 characters
  • 2-term phrase query with slop of 5

We ran the tests independently, giving Luwak and Elasticsearch a JVM heap size of 8GB, and doing an initial pre-run in order to warm the OS cache (this did not actually have a noticable effect). For sanity, we checked that each document matched the same queries in both Luwak and Percolator.

The results are shown in the graphs below, where the y-axis represents average documents processed per second.

Results 1

Results 2

Luwak was consistently faster than Percolator, ranging from a factor of 6 (for the phrase query type) to 40 (for the large Boolean queries).

The reason for this is almost certainly due to Luwak’s presearcher. When a query is added to Luwak, the library generates terms to index the query. For each incoming document, a secondary query is constructed and run against the query index, which returns a subset of the entire query set. Each of these is then run against the document in order to generate the final results. The effect of this is to reduce the number of primary queries which have to be executed against the document, often by a considerable factor (at a relatively small cost of executing the secondary query). Percolator does not have this feature, and by default matches every primary query against every document (it would be possible, but not straightforward, for an application to implement a presearch phase in Percolator). Supporting this analysis, when the Luwak presearcher was disabled its performance dropped to about the same level as Percolator.

These results must be treated with a degree of caution, for several reasons. As already explained, the queries used were randomly generated, and far simpler in structure than typical hand-crafted monitoring queries. Furthermore, the tests were single-threaded and single-sharded, whereas a multithreaded, multi-shard, distributed architecture would be typical for a real-world system. Finally, Elasticsearch Percolator is a service providing a high-level, RESTful API, while Luwak is much lower level, and would require significantly more application-level code to be implemented for a real installation.

However, since both Luwak and Percolator use the same underlying search technology, it is reasonable to conclude that the Luwak presearcher can give it a considerable performance advantage over Percolator.

If you are already using Percolator, should you change? If performance is not a problem now and is unlikely to become a problem, then the effort required is unlikely to be worth it. Luwak is not a drop-in replacement for Percolator. However, if you are planning a data-intensive streaming search system, it would be worth comparing the two. Luwak works well with existing high-performance distributed computing frameworks, which would enable applications using it to scale to very large query sets and document streams.

Our test scripts are available here. We would welcome any attempts to replicate, extend or contest our results.

Tags: , , ,

Posted in Technical

July 27th, 2015

No Comments »

BioSolr at BOSC 2015 – open source search for bioinformatics

Matt Pearce writes:

I spent most of last Friday at the Bioinformatics Open Source Conference (BOSC) Special Interest Group meeting in Dublin, as part of this year’s ISMB/ECCB conference. Tony Burdett from EMBL-EBI was giving a quick talk about the BioSolr project, and I went along to speak to people at the poster session afterwards about what we are doing, and how other teams could get involved.

Unfortunately, I missed the first half of Holly Bik’s keynote (registration seemed to take forever, hindered by dubious wifi and a printer that refused to cooperate), which used the vintage Oregon Trail game as an great analogy for biologists getting into bioinformatics – there are many, frequently intimidating, options when choosing how to analyse data, and picking the right one can be scary (this is something that definitely applies to the areas we work in as well).

There was a new approach to the traditional Q&A session afterwards as well, with questions being submitted on cards around the room, and via a Twitter hashtag. This worked pretty well, although Twitter latency did slow things down a couple of times, and there were a few shouted-out questions from the floor, but certainly better than having volunteers with microphones trying to reach the questioner across rows of people.

The morning session was on Data Science, and while a number of the talks went over my head somewhat, it was interesting to see how tools like Hadoop are being used in Bioinformatics. It was good to see the spirit of collaboration in action too, with Sebastian Schoenherr’s talk about CloudGene, a project that came about following an earlier BOSC that implements a graphical front end for Hadoop. Tony’s talk about BioSolr went down well – the show of hands for people in the room using Lucene, Solr and/or Elasticsearch indicated around 75% there were using search engines in some form. This backs up our earlier experience at the EBI, where the first BioSolr workshop was attended by teams from all over the campus, using Lucene or Solr in various versions to store and search their data.

Crossing over with lunch was the poster session, where Tony and I spoke to people about BioSolr. The Jalview team seemed especially interested in potential cross-over with their project, and there was plenty of interest generally in how the various extensions we have worked on (X-Join, hierarchical faceting) could be fitted into other projects.

The afternoon session was on the subject of Standards and Interoperability, starting with a great talk from Michael Crusoe about the Common Workflow Language, which started life at the BOSC 2014 codefest. There were several talks about Galaxy, a cloud-based platform for sharing data analyses, linking many other tools to allow workflows to be reproduced. Bruno Vieira’s talk about BioNode was also very interesting, and I made notes to check out oSwitch when time is available.

I had to leave before the afternoon’s panel took place, but all in all it was a very interesting day learning how open source software is being used outside of the areas I usually work in.

The four types of open source search project

As I’m currently writing content for our new Flax website (which is taking far longer than anticipated for various reasons I won’t bore you with) I’ve been thinking about the sort of projects we encounter at Flax. You might find this useful if you’re planning or starting a search project with Solr or Elasticsearch. Note that not everything we do fits cleanly into these four categories!

The search idea

So you’ve got this idea and you’re convinced that you need search as part of the puzzle, but you’re not sure where it fits, whether it will be performant or how to gather and transform your data so it’s ready for searching. Perhaps you’re from a startup, or maybe part of a skunkworks projects in a larger organisation. What you need is someone who really understands search software and what can be done with it to sit with you for a day or two, validate your technical choices, help you understand how to shape your data, even play with some basic indexing.

The proof of concept

You’re a little further along – you know what technology you’ll be using and you have some data all ready for indexing. However, before your funders or boss will release more budget you need to build something they can see (and search) – you’ll need an indexer and a basic search application. You could do it yourself but time is limited and you’ve not built a search application before. You’re expecting to spend a week or two developing something to show others, that lets them search real data and see real results. You might also want to experiment with scale – see what happens to performance when you add a few million items to the index, even if the schema isn’t quite right yet.

The big one

You’re building the big one – indexing complex data or many millions of items, and/or for a huge user base. You need to be very sure your indexing pipeline is fast, scales well, copes with updates and can transform data from many sources. You need to develop the very best search schema. Your search architecture must be resilient, cope with heavy load, failover cleanly and give the correct results. You’re assembling a team to build it but you need specialist help from people who have built this kind of system at scale before.

The migration

Finally you’ve secured budget to move away from the slow and innacurate search engine that everyone hates! Search really does suck, but you now have a chance to make it better. However, although you know how to keep the old engine running you don’t have much experience of open source search. Even though the old engine isn’t great, you’re doing a lot of business with it and you want to be confident that relevance is as good (and hopefully better) with the new engine – maybe you want to develop a testing framework?

We’re also increasingly delivering training (both for business users who want to know the capabilities of open source search and for technical users who want to improve their knowledge – we can tailor this to your requirements) and ongoing support – but everything starts with a search project of some kind!

Innovations in Knowledge Organisation, Singapore: a review

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore’s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge’s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

London Lucene/Solr Usergroup – Search Relevancy & Hacking Lucene with Doug Turnbull

Last week Doug Turnbull of US-based Open Source Connections visited the UK and spoke at our Meetup. His first talk was on Search Relevancy, an area that we often deal with at Flax: how to tune a search engine to give results that our clients deem relevant, without affecting the results for other queries. Using a client project as an example, Doug talked about how he created a tool to record relevance judgements for a set of queries (or a ‘case’). The underlying Solr search engine could then be adjusted and the tool re-runs the queries to show any change in the position of the scored results. Slides and video of the talk are available – thanks to our hosts SkillsMatter for these.

The tool, Quepid, is a great way to allow non-developers to score search results – in most cases we have seen, if this kind of testing is done at all it is recorded using spreadsheets. The tests then need to be re-run manually and scores updated, which can result in the tuning process taking far too long. This whole area is in need of some rigor and best practise, and to that end Doug is writing a book on Relevant Search which we’re very much looking forward to.

Doug’s second talk was on Hacking Lucene for custom search results, during which he dissected how Lucene queries actually work and how custom scoring algorithms can be used to change search ranking. Although highly technical in parts – and as Doug said, one of the hardest ways to write Lucene code to influence ranking and thus relevance – it was a great window on Lucene’s low level behaviour. Again, slides and video are available.

Thanks to all who came and especially Doug for coming so far to present his talks!

Tags: , , , ,

Posted in Technical, events

June 11th, 2015

No Comments »

Going international – open source search in London, Berlin & Singapore

We’re travelling a bit over the next few weeks to visit and speak at various events. This weekend Alan Woodward is at Berlin Buzzwords, a hacker-focused conference with a programme full of search talks. He’s not speaking this year, but if you want to talk about Lucene, Solr or our own Luwak stored search library and the crazy things you can do with it, do buy him a beer!

Next week we’re hosting another London Lucene/Solr User Group Meetup with Doug Turnbull of Open Source Connections. Doug is the author of a forthcoming book on Relevant Search and the creator of Quepid, a tool for gathering relevance judgements for Solr-based search systems and then seeing how these scores change as you tune the Solr installation. Tuning relevance is a very common (and often difficult) task during search projects and can make a significant difference to the user experience (and in particular, for e-commerce can hugely affect your bottom line) – so we’re very much looking forward to Doug’s talk.

The week after I’m in Singapore visiting the Innovations in Knowledge Organisation conference – a new event focusing on knowledge management and search. I’ve been asked to talk about open source search and to keynote the second day of the event and speak on ‘The Future of Search’. Do let me know if you’re attending and would like to meet up.

Tags: , , , , ,

Posted in events

May 29th, 2015

No Comments »

When bad search hurts: finding that elusive ROI

One thing I’ve noticed from many years attending search conferences is that the return on investment (ROI) in search technology is hard to calculate: this is particularly difficult when considering intranet and/or enterprise search, as users can usually find another way to answer their question. In most cases, some slightly tired numbers are trotted out from years-old studies, based on how much employee time a better search engine might save. I’ve never really believed this to be a sensible metric; however if you’re trying to sell a search solution (especially an overpriced, closed source, magic black box that ‘understands your content’) it may be all you can rely on.

We’ve been working recently with a couple of clients who sell significant volumes of products via their websites: in both cases the current search solution is underperforming. In this situation it’s far easier to justify an investment in better search: if customers can’t find products they simply won’t be buying them. Simple changes to the search algorithms can have huge impacts, costing or saving the company millions in revenue. However, installing a new search engine is just the beginning – it’s vital to consider a solid test strategy. To start with, the new engine should be at least as good as the old one in terms of relevance. The search logs will show what terms your users have typed into the search box, and these can be used to construct a set of queries that can be tested against both engines, with the results being scored by content experts, beta testers and even small groups of real customers. The results of this scoring can be used to inform how the new engine can be tuned – and of course, this should be an ongoing process, as search should never be built as a ‘fire and forget’ project.

There’s sadly little available on the subject of real-world relevance testing, although there’s a forthcoming book by Doug Turnbull and and John Berryman. Based on our work with the clients above, we hope later this year to be able to talk further about relevance testing and tuning – and how to do it right for e-commerce, avoiding significant financial risk.

UPDATE: Doug has been kind enough to give our readers a discount code for his book – “39turnbull” – not sure how long this will last but it gives you 39% off the price so worth a try!

Tags: , , ,

Posted in Uncategorized

April 29th, 2015

1 Comment »

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

Elastic London User Group Meetup – scaling with Kafka and Cassandra

The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.