elasticsearch – Flax

Defining relevance engineering part 4: tools

Charlie Hull — Thu, 15 Nov 2018 14:30:51 +0000

Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

In my previous installment of this guide I promised to write next about how to deliver the results of a relevance assessment, but I’ve since decided that this blog should instead cover the tools a relevance engineer can use to measure and tune search performance. Of course, some of these might be used to show results to a client as well, so it’s not an entirely different direction!

It’s also important to note that this is a rapidly evolving field and therefore cannot be a definitive list – and I welcome comments with further suggestions.

1. Gathering judgements

There are various ways to measure relevance, and one is to gather judgement data – either explicit (literally asking users to manually rate how relevant a result is) and implicit (using click data as a proxy, assuming that clicking on a result means it is relevant – which isn’t always true, unfortunately). One can build a user interface that lets users rate results (e.g. from Agnes Van Belle’s talk at Haystack Europe, see page 7) which may be available to everyone or just a select group, or one can use a specialised tool like Quepid that provides an alternative UI on top of your search engine. Even Excel or another spreadsheet can be used to record judgements (although this can become unwieldly at scale). For implicit ratings, there are Javascript libraries such as SearchHub’s search-collector or more complete analytics platforms such as Snowplow which will let you record the events happening on your search pages.

2. Understanding the query landscape

To find out what users are actually searching for and how successful their search journeys are, you will need to look at the log files of the search engine and the hosting platform it runs within. Open source engines such as Solr can provide detailed logs of every query, which will need to be processed into an overall picture. Google Analytics will tell you which Google queries brought users to your site. Some sophisticated analytics & query dashboards are also available – Luigi’s Box is a particularly powerful example for site search. Even a spreadsheets can be useful to graph the distribution of queries by volume, so you can see both the popular queries and those rare queries in the ‘long tail’. On Elasticsearch it’s even possible to submit this log data back into a search index and to display it using a Kibana visualisation.

3. Measurement and metrics

Once you have your data it’s usually necessary to calculate some metrics – overall measurements of how ‘good’ or ‘bad’ relevance is. There’s a long list of metrics commonly used by the Information Retrieval community such as NCDG which show the usefulness, or gain of a search result based on its position in a list. Tools such as Rated Ranking Evaluator (RRE) can calculate these metrics from supplied judgement lists (RRE can also run a whole test environment, spinning up Solr or Elasticsearch, performing a list of queries and recording and displaying the results).

4. Tuning the engine

Next you’ll need a way to adjust the configuration of the engine and/or figure out just why particular results are appearing (or not). These tools are usually specific to the search engine being used: Quepid, for example works with Solr and Elasticsearch and allows you to change query parameters and observe the effect on relevance scores; with RRE you can control the whole configuration of the Solr or Elasticsearch engine that it can then spin up for you. Commercial search engines will have their own tools for adjusting configuration or you may have to work within an overall content management (e.g Drupal) or e-commerce system (e.g. Hybris). Some of these latter systems may only give you limited control of the search engine, but could also let you adjust how content is processed and ingested or how synonyms are generated.

For Solr, tools such as the Google Chrome extension Solr Query Debugger can be used and the Solr Admin UI itself allows full control of Solr’s configuration. Solr’s debug query shows hugely detailed information as to why a query returned a result, but tools such as Splainer and Solr Explain are useful to make sense of this.

For Elasticsearch, the Kopf plugin was a useful tool, but has now been replaced by Cerebro. Elastic, the commercial company behind Elasticsearch offer their own tool Marvel on a 30-day free trial, after which you’ll need an Elastic subscription to use it. Marvel is built on the open source Kibana which also includes various developer tools.

If you need to dig (much) deeper into the Lucene indexes underneath Solr and Elasticsearch, the Lucene Index Toolbox (Luke) is available, or Flax’s own Marple index inspector.

As I said at the beginning this is by no means a definitive list – what are your favourite relevance tuning tools? Let me know in the comments!

In the next post I’ll cover how a relevance engineer can develop more powerful and ‘intelligent’ ways to tune search. In the meantime you can read the free Search Insights 2018 report by the Search Network. Of course, feel free to contact us if you need help with relevance engineering.

The post Defining relevance engineering part 4: tools appeared first on Flax.

Lucene Hackdays in London & Montreal

Charlie Hull — Tue, 23 Oct 2018 09:35:13 +0000

We ran a couple of Lucene Hackdays over the last couple of weeks: a chance to get together with other people working on open source search, learn from each other and to try and improve both Lucene and associated software.

Our first Hackday was in London, hosted by Mimecast at their offices near Moorgate. Despite a fire alarm practice (during which we ended up under some flats at the Barbican, whose residents may have been a little surprised at quite how many people ended up milling around under their balconies) we had a busy day – we split into three groups to look at tools for inspecting Lucene indexes, various outstanding bugs and issues with Lucene and Solr and to review a well-known issue where different Solr replicas can provide slightly different result ordering. By 5.30 p.m. when we were scheduled to finish we were still frantically hacking on some last-minute Javascript to add a feature to our Marple index inspector – luckily a few minutes later to a collective sigh of relief we had it working and we repaired to a local pub for food and drink (kindly sponsored by Elastic).

The next week a number of us were in Montreal for the Activate conference (previously known as Lucene/Solr Revolution but now sprinkled with cutting-edge AI fairy dust!). Our second Hackday was hosted by Netgovern and we worked on various Lucene/Solr issues, some improvements to our Harahachibu proxy (which attempts to block Solr updates when disk space is low) and discussed in depth how to improve the Solr onboarded experience. Pizza (sponsored by OneMoreCloud) and coffee fueled the hacking and we also added some new features including a Query Parser for MinHash queries. Many Lucene/Solr committers attended and afterwards we met up for a drink & food nearby (thanks to Searchstax for sponsoring this!) where we were joined by a few others – including Yonik Seeley, creator of Solr.

Next it was time for Activate – of which more later! Thanks to everyone who attended – you can see some notes and links about what we worked on here. Work will be continuing on these issues I’m sure.

The post Lucene Hackdays in London & Montreal appeared first on Flax.

Three weeks of search events this October from Flax

Charlie Hull — Tue, 04 Sep 2018 10:11:56 +0000

Flax has always been very active at conferences and events – we enjoy meeting people to talk about search! With much of our consultancy work being carried out remotely these days, attending events is a great way to catch up in person with our clients, colleagues and peers and to learn from others about what works (and what doesn’t) when building cutting-edge search solutions. I’m thus very glad to announce that we’re running three search events this coming October.

Earlier in the year I attended Haystack in Charlottesville, one of my favourite search conferences ever – and almost immediately began to think about whether we could run a similar event here in Europe. Although we’ve only had a few months I’m very happy to say we’ve managed to pull together a high-quality programme of talks for our first Haystack Europe event, to be held in London on October 2nd. The event is focused on search relevance from both a business and a technical perspective and we have speakers from global retailers and by specialist consultants and authors. Tickets are already selling well and we have limited space, so I would encourage you to register as soon as you can (Haystack USA sold out even after the capacity was increased). We’re running the event in partnership with Open Source Connections.

The next week we’re running a Lucene Hackday on October 9th as part of our London Lucene/Solr Meetup programme. Building on previous successful events, this is a day of hacking on the Apache Lucene search engine and associated software such as Apache Solr and Elasticsearch. You can read up on what we achieved at our last event a couple of years ago – again, space is limited, so sign up soon to this free event (huge thanks to Mimecast for providing the venue and to Elastic for sponsoring drinks and food for an evening get-together afterwards). Bring a laptop and your ideas (and do comment on the event page if you have any suggestions for what we should work on).

We’ll be flying to Montreal soon afterwards to attend the Activate conference (run by our partners Lucidworks) and while we’re there we’ll host another free Lucene Hackday on October 15th. Again, this would not be possible without sponsorship and so thanks must go to Netgovern, SearchStax and One More Cloud. Remember to tell us your ideas in the comments.

So that’s three weeks of excellent search events – see you there!

The post Three weeks of search events this October from Flax appeared first on Flax.

Lucene Solr London: Search Quality Testing and Search Procurement

Charlie Hull — Fri, 29 Jun 2018 11:09:34 +0000

Mimecast were our kind hosts for the latest London Lucene/Solr Meetup (and even provided goodie bags). It’s worth repeating that we couldn’t run these events without the help of sponsors and hosts and we’re always very grateful (and keep those offers coming!).

First up was Andrea Gazzarini presenting a brand new framework for search quality testing. Designed for offline measurement, Rated Ranking Evaluator is an open source Java library (although it can be used from other languages). It uses a heirarchical model to arrange queries into query groups (all queries in a query group should be producing the same results). Each test can run across a number of search engine configuration versions and outputs results in JSON format – but these can also be translated into Excel spreadsheets, PDFs or sent to a server that provides a live console showing how search quality is affected by a search engine configuration change. Although aimed at Elasticsearch and Solr, the platform is extensible to any underlying search engine. This is a very useful tool for search developers and joins Quepid and Searchhub’s recently released search analytics acquisition library in the ‘toolbox’ for relevance engineers. You can see Andrea’s slides here.

Martin White spoke next on how open source search solutions fare in corporate procurements for enterprise search. This was an engaging talk from Martin , showing the scale of the opportunities for open source platforms with budgets of several million pounds being common for enterprise search projects. However, as he mentioned it can be very difficult for procurement departments to get information from vendors and ‘the last thing you’ll know about a piece of enterprise software is how much it will cost’. He detailed how open source solutions often compare badly against closed source commercial offerings due to it being hard to see the ‘edges’ – e.g. what custom development will be necessary to fulfil enterprise requirements. Although the opportunities are clear, it seems open source based solutions still have a way to go to compete. You can read more from Martin on this subject in the recent free Search Insights report.

Thanks to Mimecast and both speakers – we’ll be back after the summer with another Meetup!

The post Lucene Solr London: Search Quality Testing and Search Procurement appeared first on Flax.

Defining relevance engineering, part 1: the background

Charlie Hull — Mon, 25 Jun 2018 10:40:12 +0000

Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

Let’s start by turning the clock back a few years. Ten or fifteen years ago search engines were usually closed source, mysterious black boxes, costing five or six-figure sums for even relatively modest installations (let’s say a couple of million documents – small by today’s standards). Huge amounts of custom code were necessary to integrate them with other systems and projects would take many months to demonstrate even basic search functionality. The trick was to get search working at all, even if the eventual results weren’t very relevant. Sadly even this was sometimes difficult to achieve.

Nowadays, search technology has become highly commoditized and many developers can build a functioning index of several milion documents in a couple of days with off-the-shelf, open source, freely available software. Even the commercial search firms are using open source cores – after all, what’s the point of developing them from scratch? Relevance is often ‘good enough’ out of the box for non business-critical applications.

A relevance engineer is required when things get a little more complicated and/or when good search is absolutely critical to your business. If you’re trading online, search can be a major driver of revenue and getting it wrong could cost you millions. If you’re worried about complying with the GDPR, MiFID or other regulations then ‘good enough’ simply isn’t if you want to prevent legal issues. If you’re serious about saving the time and money your employees waste looking for information or improving your business’ ability to thrive in a changing world then you need to do search right.

So what search engine should you choose before you find a relevance engineer to help with it? I’m going to go out on a limb here and say it doesn’t actually matter that much. At Flax we’re proponents of open source engines such as Apache Lucene/Solr and Elasticsearch (which have much to recommend them) but the plain fact is that most search engines are the same under the hood. They all use the same basic principles of information retrieval; they all build indexes of some kind; they all have to analyze the source data and user queries in much the same way (ignore ‘cognitive search’ and other ‘AI’ buzzwords for now, most of this is marketing rather than actual substance). If you’re using Microsoft Sharepoint across your business we’re not going to waste your time trying to convince you to move wholesale to a Linux-based open source alternative.

Any modern search engine should allow you the flexibility to adjust how data is ingested, how it is indexed, how queries are processed and how ranking is done. These are the technical tools that the relevance engineer can use to improve search quality. However, relevance engineering is never simply a technical task – in fact, without a business justification, adjusting these levers may make things worse rather than better.

In the next post I’ll cover how a relevance engineer can engage with a business to discover the why of relevance tuning. In the meantime you can read Doug Turnbull’s chapter in the free Search Insights 2018 report by the Search Network (the rest of the report is also very useful) and you might also be interested in the ‘Think like a relevance engineer’ training he is running soon in the USA. Of course, feel free to contact us for details of similar UK or EU-based training or if you need help with relevance engineering.

The post Defining relevance engineering, part 1: the background appeared first on Flax.

Catching MICES – a focus on e-commerce search

Charlie Hull — Tue, 19 Jun 2018 14:15:55 +0000

The second event I attended in Berlin last week was the Mix Camp on e-commerce search (MICES), a small and focused event now in its second year and kindly hosted by Mytoys at their offices. Slides for the talks are available here and I hope videos will appear soon.

The first talk was given by Karen Renshaw of Grainger, who Flax worked with at RS Components (she also wrote a great series of blog posts for us on improving relevancy). Karen’s talk drew on her long experience of managing search teams from a business standpoint – this wasn’t about technology but about combining processes, targets and objectives to improve search quality. She showed how to get started by examining customer feedback, known issues, competitors and benchmarks; how to understand and categorise query types; create a test plan within a cross-functional team and to plan for incremental change. Testing was covered including how to score search quality and how to examine the impact of search changes, with the message that “all aspects of search should work together to help customers through their journey”. She concluded with the clear point that there are no silver bullets, and that expectations must be managed during an ongoing, iterative process of improvement. This was a talk to set the scene for the day and containing lessons for every search manager (and a good few search technologists who often ignore the business factors!).

Next up were Christine Bellstedt & Jens Kürsten from Otto, Germany’s second biggest online retailer with over 850,000 search queries a day. Their talk focused on bringing together the users and business perspective to create a search quality testing cycle. They quoted Peter Freis’ graphic from his excellent talk at Haystack to illustrate how they created an offline system for experimentation with new ranking methods based on linear combinations of relevance scores from Solr, business performance indicators and product availability. They described how they learnt how hard it can be to select ranking features, create test query sets with suitable coverage and select appropriate metrics to measure. They also talked about how the experimentation cycle can be used to select ‘challengers’ to the current ‘champion’ ranking method, which can then be A/B tested online.

Pavel Penchev of SearchHub was next and presented their new search event collector library – a Javascript SDK which can be used to collect all kinds of metrics around user behaviour and submit them directly to a storage or analytics system (which could even be a search engine itself – e.g. Elasticsearch/Kibana). This is a very welcome development – only a couple of months ago at Haystack I heard several people bemoaning the lack of open source tools for collecting search analytics. We’ll certainly be trying out this open source library.

Andreas Brückner of e-commerce search vendor Fredhopper talked about the best way to optimise search quality in a business context. His ten headings included “build a dedicated search team” – although 14% of Fredhoppers own customers have no dedicated search staff – “build a measurement framework” – how else can you see how revenue might be improved? and “start with user needs, not features”. Much to agree with in this talk from someone with long experience of the sector from a vendor viewpoint.

Johannes Peter of MediaMarktSaturn described an implementation of a ‘semantic’ search platform which attempts to understand queries such as ‘MyMobile 7 without contract’, recognising this is a combination of a product name, a Boolean operator and an attribute. He described how an ontology (perhaps showing a family of available products and their variants) can be used in combination with various rules to create a more focused query e.g. “title:(“MyMobile7″) AND NOT (flag:contract)”. He also mentioned machine learning and term co-occurrence as useful methods but stressed that these experimental techniques should be treated with caution and one should ‘fail early’ if they are not producing useful results.

Ashraf Aaref & Felipe Besson described their journey using Learning to Rank to improve search at GetYourGuide, a marketplace for activities (e.g. tours and holidays). Using Elasticsearch and the LtR plugin recently released by our partners OpenSourceConnections they tried to improve the results for their ‘location pages’ (e.g. for Paris) but their first iteration actually gave worse results than the current system and was thus rejected by their QA process. They hope to repeat the process using what they have learned about how difficult it is to create good judgement data. This isn’t the first talk I’ve seen that honestly admits that ML approaches to improving search aren’t a magic silver bullet and the work itself is difficult and requires significant investment.

Duncan Blythe of Zalando gave what was the most forward-looking talk of the event, showing a pure Deep Learning approach to matching search queries to results – no query parsing, language analysis, ranking or anything, just a system that tries to learn what queries match which results for a product search. This reminded me of Doug & Tommaso’s talk at Buzzwords a couple of days before, using neural networks to learn the journey between query and document. Duncan did admit that this technique is computationally expensive and in no way ready for production, but it was exciting to hear about such cutting-edge (and well funded) research.

Doug Turnbull was the last speaker with a call to arms for more open source tooling, datasets and relevance judgements to be made available so we can all build better search technology. He gave a similar talk to keynote the Haystack event two months ago and you won’t be surprised to hear that I completely agree with his viewpoint – we all benefit from sharing information.

Unfortunately I had to leave MICES at this point and missed the more informal ‘bar camp’ event to follow, but I would like to thank all the hosts and organisers especially René Kriegler for such an interesting day. There seems to be a great community forming around e-commerce search which is highly encouraging – after all, this is one of the few sectors where one can draw a clear line between improving relevance and delivering more revenue.

The post Catching MICES – a focus on e-commerce search appeared first on Flax.

When even the commercial vendors are using it, has open source search won?

Charlie Hull — Thu, 15 Mar 2018 12:03:32 +0000

There have been some interesting announcements recently which may point to an increasing realisation amongst commercial search firms that an open source model is an essential advantage in today’s search market. Coveo have announced that their enterprise search engine can run on an Elasticsearch core, an interesting move for a previously decidedly closed source company. BA Insight, who have previously provided extensions and enhancements for Microsoft’s decidedly closed-source Sharepoint search facility, have been offering Elasticsearch as a core search engine for quite a while. It is also an open secret that some other commercial search firms (such as Attivio) use Apache Lucene as a core technology.

The commercial search firms will have noticed that Lucidworks (who employ a large proportion of Lucene/Solr committers) have announced Lucidworks Fusion 4, which can be used for site and enterprise search. Elastic, the company behind Elasticsearch, recently acquired Swiftype and have repositioned it as a packaged site search engine (with an enterprise search version in beta and rumoured to appear later this year). Both Lucidworks and Elastic are thus attempting to capture a larger segment of the search market, using their dominance and expertise in the open source world. Note however that all these products are ‘open core’ rather than ‘open source’ (despite Elastic’s attempts to pretend otherwise) – which is not very different from Coveo or BA Insight’s approach – so the distance between the traditonally separate ‘open source’ and ‘closed source’ search vendors is now closing.

The question for any search vendor should be whether there is any point developing and maintaining a closed source search engine core, when Lucene derivatives such as Solr and Elasticsearch are so well established. The race between closed and open source is perhaps over.

Here at Flax we’ve been building open source search engines since 2001 and we’re independent of any vendor – so if you need help with your search project, do let us know.

Note: Enterprise Search is usually defined as a search engine working behind a corporate firewall, indexing different content sources such as flat files, databases and intranets. Site Search is usually visible to non-employees and only indexes websites. However, when site search includes an intranet the boundary becomes a little fuzzy – is this lightweight enterprise search? In most cases this doesn’t hugely matter – the underlying search engine core will be the same, it’s simply a difference in where source data comes from and how it is presented to users. However, these two options are often presented as different products by vendors.

UPDATE: A few days after I posted this blog, commercial vendor Attivio released SUIT, an open source user interface library that can run on their own engine, Elasticsearch or Solr. It seems the trend continues.

The post When even the commercial vendors are using it, has open source search won? appeared first on Flax.

No, Elastic X-Pack is not going to be open source – according to Elastic themselves

Charlie Hull — Fri, 02 Mar 2018 14:47:49 +0000

Elastic are the company founded by the creator of Elasticsearch, Shay Banon. At this time of year they have their annual Elasticon conference in San Francisco and as you might expect a lot of announcements are made during the week of the conference. The major ones to appear this time are that Swiftype, which Elastic acquired last year, has reappeared as Elastic Site Search and that Elastic are opening the code for their commercial X-Pack features.

Shay Banon is always keen to relate how Elasticsearch started as open source and will remain true to that heritage, which is always encouraging to hear. However it’s unfortunate to note that the announcement has been reported by many as ‘X-Pack is now open source’ – and the truth is a little more complicated than that.

Firstly, let’s look at the Elasticsearch core code itself. Yes, this is open source under the Apache 2 license, so you can download it, modify it, fork it, even incorporate it into your own products if you like. However most people would like to keep up with the latest and greatest developments so they’ll want to stick with the ‘official’ stream of updates, and what goes into this is entirely up to Elastic employees as they are the only ones allowed to commit to the codebase. Some measure of control of an open source project is essential of course, but this is certainly not ‘open development’ even though it is ‘open source’. Compare this to Apache Lucene/Solr, where those that are allowed to commit code to the official releases are from a wide variety of organisations (and elected as committers by merit, by a group of other longstanding committers). This distinction is important but makes little difference to most adopters.

Elastic have also for some years produced commercial, closed-source software in addition to Elasticsearch – which they call the X-Pack. To use this code you have to license it, although for some of the features the license is free. The announcement this week is that the source code for the X-Pack will be open and available to read under a Elastic license (which hasn’t yet been made available). As Doug Turnbull of our partner company Open Source Connections writes “Be careful: The ‘open source’ Elastic XPack is very different than what most think of as ‘open source'”. To use some of these features you have the source code for in production, you will still need to pay Elastic for a license. If you spot a problem in the source code and submit a patch, you still may end up paying Elastic for the privilege of running it. This is an ‘open core’ model, where the further you move away from the core, the less open and free things become – and as Shay writes this is a key part of their business model.

The final word on this comes from Elastic’s own FAQ on the X-Pack: ” Open Source licensing maintains a strict definition from the Open Source Initiative (OSI). As of 6.3, the X-Pack code will be opened under an Elastic EULA. However, it will not be ‘Open Source’ as it will not be covered by an OSI approved license. “. It’s a shame that this hasn’t been accurately reported.

If you are considering open source search software for your project, contact us for independent and honest advice. We’ve been building open source search applications since 2001.

The post No, Elastic X-Pack is not going to be open source – according to Elastic themselves appeared first on Flax.

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

Charlie Hull — Thu, 01 Feb 2018 10:13:56 +0000

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch from Charlie Hull

The post Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch appeared first on Flax.

Inspiring students to work in Open Source Search

Charlie Hull — Wed, 31 Jan 2018 13:54:16 +0000

I’ve recently been asked to join the Industrial Advisory Board for the School of Computer Science and Electronic Engineering at the University of Essex and will be talking to students there on Monday 5th February, repeating a similar talk I did last year. The subject is ‘Working in Open Source Search’ and I’ll describe how we founded Flax back in 2001, how we’ve built, tuned and implemented open source search engines and some of the client projects we’ve worked on. It’s been a fascinating journey.

My main motivation for talking at Essex (and at City University a couple of weeks later) will be to inspire students to consider working in the world of open source software and more specifically the commercial applications of what academics call information retrieval – search engines. It’s an interesting field to work in – we have clients in a huge variety of sectors including e-commerce, law, publishing and government; we deal with both small startups and multinational businesses and help build systems indexing a few thousand to several billion items. It’s constantly changing as new requirements, ideas and innovations appear. It’s taken our staff around the world (Singapore, Malaysia, the USA, Denmark as a small sample from the last couple of years) and led to us gaining a global reputation and becoming part of a select group of independent search specialists. From being somewhat of a curiosity when we started, open source search engines have now gained huge acceptance and have changed the search market beyond recognition – no longer can vendors charge six or seven figures for mysterious black boxes (and more to make them actually do something useful).

However our sector needs more people – not just developers, but business-focused search managers who understand how to build search engines that truly deliver value to employees and customers. As I’ll say to the students next week there’s a skill shortage, plainly illustrated by the plaintive slide that ends nearly all search conference and Meetup presentations – “We’re Hiring!”. Time to learn to code, download Lucene/Solr or Elasticsearch, try out the examples, read our book and look forward to a great career in search!

The post Inspiring students to work in Open Source Search appeared first on Flax.