Technical – Flax

Defining relevance engineering part 4: tools

Charlie Hull — Thu, 15 Nov 2018 14:30:51 +0000

Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

In my previous installment of this guide I promised to write next about how to deliver the results of a relevance assessment, but I’ve since decided that this blog should instead cover the tools a relevance engineer can use to measure and tune search performance. Of course, some of these might be used to show results to a client as well, so it’s not an entirely different direction!

It’s also important to note that this is a rapidly evolving field and therefore cannot be a definitive list – and I welcome comments with further suggestions.

1. Gathering judgements

There are various ways to measure relevance, and one is to gather judgement data – either explicit (literally asking users to manually rate how relevant a result is) and implicit (using click data as a proxy, assuming that clicking on a result means it is relevant – which isn’t always true, unfortunately). One can build a user interface that lets users rate results (e.g. from Agnes Van Belle’s talk at Haystack Europe, see page 7) which may be available to everyone or just a select group, or one can use a specialised tool like Quepid that provides an alternative UI on top of your search engine. Even Excel or another spreadsheet can be used to record judgements (although this can become unwieldly at scale). For implicit ratings, there are Javascript libraries such as SearchHub’s search-collector or more complete analytics platforms such as Snowplow which will let you record the events happening on your search pages.

2. Understanding the query landscape

To find out what users are actually searching for and how successful their search journeys are, you will need to look at the log files of the search engine and the hosting platform it runs within. Open source engines such as Solr can provide detailed logs of every query, which will need to be processed into an overall picture. Google Analytics will tell you which Google queries brought users to your site. Some sophisticated analytics & query dashboards are also available – Luigi’s Box is a particularly powerful example for site search. Even a spreadsheets can be useful to graph the distribution of queries by volume, so you can see both the popular queries and those rare queries in the ‘long tail’. On Elasticsearch it’s even possible to submit this log data back into a search index and to display it using a Kibana visualisation.

3. Measurement and metrics

Once you have your data it’s usually necessary to calculate some metrics – overall measurements of how ‘good’ or ‘bad’ relevance is. There’s a long list of metrics commonly used by the Information Retrieval community such as NCDG which show the usefulness, or gain of a search result based on its position in a list. Tools such as Rated Ranking Evaluator (RRE) can calculate these metrics from supplied judgement lists (RRE can also run a whole test environment, spinning up Solr or Elasticsearch, performing a list of queries and recording and displaying the results).

4. Tuning the engine

Next you’ll need a way to adjust the configuration of the engine and/or figure out just why particular results are appearing (or not). These tools are usually specific to the search engine being used: Quepid, for example works with Solr and Elasticsearch and allows you to change query parameters and observe the effect on relevance scores; with RRE you can control the whole configuration of the Solr or Elasticsearch engine that it can then spin up for you. Commercial search engines will have their own tools for adjusting configuration or you may have to work within an overall content management (e.g Drupal) or e-commerce system (e.g. Hybris). Some of these latter systems may only give you limited control of the search engine, but could also let you adjust how content is processed and ingested or how synonyms are generated.

For Solr, tools such as the Google Chrome extension Solr Query Debugger can be used and the Solr Admin UI itself allows full control of Solr’s configuration. Solr’s debug query shows hugely detailed information as to why a query returned a result, but tools such as Splainer and Solr Explain are useful to make sense of this.

For Elasticsearch, the Kopf plugin was a useful tool, but has now been replaced by Cerebro. Elastic, the commercial company behind Elasticsearch offer their own tool Marvel on a 30-day free trial, after which you’ll need an Elastic subscription to use it. Marvel is built on the open source Kibana which also includes various developer tools.

If you need to dig (much) deeper into the Lucene indexes underneath Solr and Elasticsearch, the Lucene Index Toolbox (Luke) is available, or Flax’s own Marple index inspector.

As I said at the beginning this is by no means a definitive list – what are your favourite relevance tuning tools? Let me know in the comments!

In the next post I’ll cover how a relevance engineer can develop more powerful and ‘intelligent’ ways to tune search. In the meantime you can read the free Search Insights 2018 report by the Search Network. Of course, feel free to contact us if you need help with relevance engineering.

The post Defining relevance engineering part 4: tools appeared first on Flax.

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018

Charlie Hull — Mon, 18 Jun 2018 13:53:27 +0000

I spent last week in a sunny Berlin for the Berlin Buzzwords event (and subsequently MICES 2018, of which more later). This was my first visit to Buzzwords which was held in an arts & culture complex in an old brewery north of the city centre. The event was larger than I was expecting at around 550 people with three main tracks of talks. Although due to some external meetings I didn’t attend as many talks as I would have liked, here are a few highlights. Many of the talks have slides provided and some are now also available on the Buzzwords Youtube channel.

Giovanni Fernandez-Kincade talked about query understanding to improve both recall and precision for searches. He made the point that users and documents often speak very different languages which can lead to a lack of confidence in the search engine. Various techniques are available to attempt to translate the user’s intention into a suitable query and these can be placed on a spectrum from human-powered (e.g. creating an exception list to prevent stemming of proper nouns) to some degree of automation (e.g. harvesting data to build lists of synonyms) to fully automation (machine learning of how queries map to documents). Obviously these also fit on other scales from labour-intensive to hands-off and easy to hard in terms of the technology skills required. This talk gave a solid base understanding of the techniques available.

I dropped in on Suneel Marthi’s talk on detecting tulip fields from satellite images, which was fascinating although outside my usual area of search engine technology. I then heard Nick Burch describe the many ways that text extraction powered by Apache Tika can crash your JVM or even your entire cluster (potentially expensive in an elastically-scaling situation as more resources are automatically allocated!). As he recommended one should expect failure and plan accordingly, ship logs somewhere central for analysis and never run Tika inside your Solr instance itself in a production system (a recommendation that has finally made it to the Solr Wiki). Doug Turnbull and Tommaso Teofili then spoke on The Neural Search Frontier, a wide-ranging and in some places somewhat speculative discussion of techniques to improve ranking using word embeddings described by multidimensional vectors. This approach combined traditional IR techniques with neural models to learn whether a document is relevant to a query. One fascinating idea was the use of recurrent neural networks, much used in translation applications, to ‘translate’ a document to a predicted query. As with most of Doug’s talks this gave us a lot to think about but he finished with a plea for better native vector support in Lucene-based search engines.

The next talk I heard was from Varun Thacker on Solr autoscaling which I know is a particular concern of some of our clients as their data volumes grow. These new features in Solr version 7 allow policies and preferences to be set up to govern autoscaling behaviour, where shards may be moved and new cores created automatically based on metrics such as disk space or queries-per-second. One interesting line of questioning from the audience was how to avoid replicas from ‘ping ponging’ between hosts – e.g moving from a node with low disk space to one with more disk space, but then causing a reduction in disk space on the target node, leading to another move. Usefully the autoscaling system can be set to compute a list of operations but leave execution to a human operator, which may help prevent this problem.

The next day I attended Tomás Fernández Löbbe’s talk on new replica types in Solr 7, which talked about the advantages of the ‘Master/Slave’ model for search cluster design as opposed to the standard SolrCloud ‘every node does everything’ model. The new replica types PULL and TLOG allow one to build a master/slave setup in SolrCloud, separating responsibility for indexing and searching and even choosing which type of replica to use in queries. I also heard Houston Putman talk about data analytics with Solr, describing how built-in Solr functions can carry out the type of analytics previously only possible with Apache Spark or Hadoop and avoiding the extra cost of shipping data out of Solr. Unfortunately that was the end of my conference due to some other commitments but it was great to catch up with various search people from Europe and further abroad and to enjoy what was a well-organised and interesting event.

The post Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018 appeared first on Flax.

London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco

Charlie Hull — Thu, 08 Feb 2018 14:55:22 +0000

This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you!

Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.

Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.

Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

Charlie Hull — Thu, 01 Feb 2018 10:13:56 +0000

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch from Charlie Hull

The post Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch appeared first on Flax.

A search-based suggester for Elasticsearch with security filters

Tom — Thu, 16 Nov 2017 14:30:12 +0000

Both Solr and Elasticsearch include suggester components, which can be used to provide search engine users with suggested completions of queries as they type:

Query autocomplete has become an expected part of the search experience. Its benefits to the user include less typing, speed, spelling correction, and cognitive assistance.

A challenge we have encountered with a few customers is autocomplete for search applications which include user-based access control (i.e. certain documents or classes of document are hidden from certain users or classes of user). In general, it is desirable not to suggest query completions to users which only match documents they do not have access to. For one thing, if the system suggests a query which then returns no results, it confounds the user’s expectation and makes it look like the system is in error. For another, suggestions may “leak” information from the system that the administrators would rather remain hidden (e.g. an intranet user could type “dev” into a search box and get “developer redundancies” as a suggestion.)

Access control logic is often implemented as a Boolean filter query. Although both the Solr and Elasticsearch suggesters have simple “context” filtering, they do not allow arbitrary Boolean filters. This is because the suggesters are not implemented as search components, for reasons of performance.

To be useful, suggesters must be fast, they must provide suggestions which make intuitive sense to the user and which, if followed, lead to search results, and they must be reasonably comprehensive (they should take account of all the content which the user potentially has access to.) For these reasons, it is impractical in most cases to obtain suggestions directly from the main index using a search-based method.

However, an alternative is to create an auxiliary index consisting of suggestion phrases, and retrieve suggestions using normal queries. The source of the suggestion index can be anything you like: hand-curated suggestions and logged user queries are two possibilities.

To demonstrate this I have written a small proof-of-concept system for a search-based suggester where the suggestions are generated directly from the main documents. Since any access control metadata is also available from the documents, we can use it to exclude suggestions based on the current user. A document in the suggester index looks something like this:

suggestion: "secret report"
freq: 16
meta:
  - include_groups: [ "directors" ]
    exclude_people: [ "Bob", "Lauren" ]
  - include_groups: [ "financial", "IT" ]
    exclude_people: [ "Max" ]

In this case, the phrase “secret report” has been extracted from one or more documents which are visible to the group “directors” (excluding Bob and Lauren) and one or more documents visible to groups “financial” and “IT” (excluding Max.) Thus, “secret report” can be suggested only to those people who have access to the source documents (if filtering is included in the suggestion query).

The proof of concept uses Elasticsearch, and includes Python code to create the main and the suggestion indexes, and a script to demonstrate filtered suggesting. The repository is here.

If you would like Flax to help build suggesters for your search application, do get in touch!

The post A search-based suggester for Elasticsearch with security filters appeared first on Flax.

Worth the wait – Apache Kafka hits 1.0 release

Charlie Hull — Thu, 02 Nov 2017 09:50:20 +0000

We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple but powerful concepts – everything it deals with is a stream of data (like a messaging system), streams can be combined for processing and stored reliably in a highly fault-tolerant way. It’s also massively scalable.

For search applications, Kafka is a great choice for the ‘wiring’ between source data (databases, crawlers, flat files, feeds) and the search index and other parts of the system. We’ve used other message passing systems (like RabbitMQ) in projects before, but none have the simplicity and power of Kafka. Combine the search index with analysis and visualisation tools such as Kibana and you can build scalable, real-time systems for ingesting, storing, searching and analysing huge volumes of data – for example, we’ve already done this for clients in the financial sector wanting to monitor log data using open-source technology, rather than commercial tools such as Splunk.

The development of Kafka has been masterminded by our partners Confluent, and it’s a testament to this careful management that the milestone 1.0 version has only just appeared. This doesn’t mean that previous versions weren’t production ready – far from it – but it’s a sign that Kafka has now matured to be a truly enterprise-scale project. Congratulations to all the Kafka team for this great achievement.

We look forward to working more with this great software – and if you need help with your Kafka project do get in touch!

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

Better performance with the Logstash DNS filter

Tom — Thu, 17 Aug 2017 15:45:58 +0000

We’ve been working on a project for a customer which uses Logstash to read messages from Kafka and write them to Elasticsearch. It also parses the messages into fields, and depending on the content type does DNS lookups (both forward and reverse.)

While performance testing I noticed that adding caching to the Logstash DNS filter actually reduced performance, contrary to expectations. With four filter worker threads, and the following configuration:

dns { 
  resolve => [ "Source_IP" ] 
  action => "replace" 
  hit_cache_size => 8000 
  hit_cache_ttl => 300 
  failed_cache_size => 1000 
  failed_cache_ttl => 10
}

the maximum throughput was only 600 messages/s, as opposed to 1000 messages/s with no caching (4000/s with no DNS lookup at all).

This was very odd, so I looked at the source code. Here is the DNS lookup when a cache is configured:

address = @hitcache.getset(raw) { retriable_getaddress(raw) }

This executes retriable_getaddress(raw) inside the getset() cache method, which is synchronised. Therefore, concurrent DNS lookups are impossible when a cache is used.

To see if this was the problem, I created a fork of the dns filter which does not synchronise the retriable_getaddress() call.

 address = @hit_cache[raw]
 if address.nil?
   address = retriable_getaddress(raw)
   unless address.nil?
     @hit_cache[raw] = address
   end
 end

Tests on the same data revealed a throughput of nearly 2000 messages/s with four worker threads (and 2600 with eight threads), which is a significant improvement.

This filter has the disadvantage that it might redundantly look up the same address multiple times, if the same domain name/IP address turns up in several worker threads simultaneously (but the risk of this is probably pretty low, depending on the input data, and in any case it’s harmless.)

I have released a gem of the plugin if you want to try it. Comments appreciated.

The post Better performance with the Logstash DNS filter appeared first on Flax.

Elasticsearch, Kibana and duplicate keys in JSON

Tom — Thu, 03 Aug 2017 11:05:14 +0000

JSON has been the lingua franca of data exchange for many years. It’s human-readable, lightweight and widely supported. However, the JSON spec does not define what parsers should do when they encounter a duplicate key in an object, e.g.:

{
  "foo": "spam",
  "foo": "eggs",
  ...
}

Implementations are free to interpret this how they like. When different systems have different interpretations this can cause problems.

We recently encountered this in an Elasticsearch project. The customer reported unusual search behaviour around a boolean field called draft. In particular, documents which were thought to contain a true value for draft were being excluded by the query clause

{
  "query":
    "bool": {
      "must_not": {
        "term": { "draft": false }
      },
      ...

The version of Elasticsearch was 2.4.5 and we examined the index with Sense on Kibana 4.6.3. The documents in question did indeed appear to have the value

{
  "draft": true,
  ...
}

and therefore should not have been excluded by the must_not query clause.

To get to the bottom of it, we used Marple to examine the terms in the index. Under the bonnet, the boolean type is indexed as the term “T” for true and “F” for false. The documents which were behaving oddly had both “T” and “F” terms for the draft field, and were therefore being excluded by the must_not clause. But how did the extra “F” term get in there?

After some more experimentation we tracked it down to a bug in our indexer application, which under certain conditions was creating documents with duplicate draft keys:

{
  "draft": false,
  "draft": true
  ...
}

So why was this not appearing in the Sense output? It turns out that Elasticsearch and Sense/Kibana interpret duplicate keys in different ways. When we used curl instead of Sense we could see both draft items in the _source field. Elasticsearch was behaving consistently, storing and indexing both draft fields. However, Sense/Kibana was quietly dropping the first instance of the field and displaying only the second, true, value.

I’ve not looked at the Sense/Kibana source code, but I imagine this is just a consequence of being implemented in Javascript. I tested this in Chrome (59.0.3071.115 on macOS) with the following script:

which output (with no warnings)

value of o.b true
test.html:13 value of o {
 "s": "this is some text",
 "b": true
}

(in fact it turns out that order of b doesn’t matter, true always overrides false.)

Ultimately this wasn’t caused by any bugs in Elasticsearch, Kibana, Sense or Javascript, but the different way that duplicate JSON keys were being handled made finding the ultimate source of the problem harder than it needed to be. If you are using the Kibana console (or Sense with older versions) for Elasticsearch development then this might be a useful thing to be aware of.

I haven’t tested Solr’s handling of duplicate JSON keys yet but that would probably be an interesting exercise.

The post Elasticsearch, Kibana and duplicate keys in JSON appeared first on Flax.

London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr

Charlie Hull — Fri, 02 Jun 2017 14:31:32 +0000

Bloomberg kindly hosted the London Lucene/Solr Meetup last night and we were lucky enough to have two excellent speakers for the thirty or so attendees. René Kriegler kicked off with a talk about the Querqy library he has developed to provide a pre-processing layer for Solr (and soon, Elasticsearch) queries. This library was originally developed during a project for Germany’s largest department store Galeria Kaufhof and allows users to add a series of simple rules in a text file to raise or lower results containing certain words, filter out certain results, add synonyms and decompound words (particularly important for German!). We’ve seen similar rules-based systems in use at many of our e-commerce clients, but few of these work well with Solr (Hybris in particular has a poor integration with Solr and can produce some very strange Solr queries). In contrast, Querqy is open source and designed by someone with expert Solr knowledge. With the addition of a simple UI or an integration with a relevancy-testing framework such as Quepid, this could be a fantastic tool for day-to-day tuning of search relevance – without the need for Solr expertise. You can find Querqy on Github.

Michael Suzuki of Alfresco talked next about the importance of being bilingual (actually he speaks 4 languages!) and how new features in Solr version 6 allow one to use either Solr syntax, SQL expressions or a combination of both. This helps hide Solr’s complexity and also allows easy integration with database administration and reporting tools, while allowing use of Solr by the huge number of developers and database administrators familiar with SQL syntax. Using a test set from the IMDB movie archive he demonstrated how SQL expressions can be used directly on a Solr index to answer questions such as ‘what are the highest grossing film actors’. He then used visualisation tool Apache Zeppelin to produce various graphs based on these queries and also showed dbVisualizer, a commonly used database administration tool, connecting directly to Solr via JDBC and showing the index contents as if they were just another set of SQL tables. He finished by talking briefly about the new statistical programming features in Solr 6.6 – a powerful new development with features similar to the R language.

We continued with a brief Q&A session . Thanks to both our speakers – we’ll be back again soon!

The post London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr appeared first on Flax.

Release 1.0 of Marple, a Lucene index detective

Tom — Fri, 24 Feb 2017 14:34:05 +0000

Back in October at our London Lucene Hackday Flax’s Alan Woodward started to write Marple, a new open source tool for inspecting Lucene indexes. Since then we have made nearly 240 commits to the Marple GitHub repository, and are now happy to announce its first release.

Marple was envisaged as an alternative to Luke, a GUI tool for introspecting Lucene indexes. Luke is a powerful tool but its Java GUI has not aged well, and development is not as active as it once was. Whereas Luke uses Java widgets, Marple achieves platform independence by using the browser as the UI platform. It has been developed as two loosely-coupled components: a Java and Dropwizard web service with a REST/JSON API, and a UI implemented in React.js. This approach should make development simpler and faster, especially as there are (arguably) many more React experts around these days than native Java UI developers, and will also allow Marple’s index inspection functionality to be easily added to other applications.

Marple is, of course, named in honour of the famous fictional detective created by Agatha Christie.

What is Marple for? We have two broad use cases in mind: the first is as an aid for solving problems with Lucene indexes. With Marple, you can quickly examine fields, terms, doc values, etc. and check whether the index is being created as you expect, and that your search signals are valid. The other main area of use we imagine is as an educational tool. We have made an effort to make the API and UI designs reflect the underlying Lucene APIs and data structures as far as is practical. I have certainly learned a lot more about Lucene from developing Marple, and we hope that other people will benefit similarly.

The current release of Marple is not complete. It omits points entirely, and has only a simple UI for viewing documents (stored fields). However, there is a reasonably complete handling of terms and doc values. We’ll continue to develop Marple but of course any contributions are welcome.

You can download this first release of Marple here together with a small Lucene index of Project Gutenberg to inspect. Details of how to run Marple (you’ll need Java) are available in the README. Do let us know what you think – bug reports or feature requests can be submitted via Github. We’ll also be demonstrating Marple in London on March 23rd 2017 at the next London Lucene/Solr Meetup.

The post Release 1.0 of Marple, a Lucene index detective appeared first on Flax.