As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we have also been developing some features for Elasticsearch). These are now largely working, with some quirks which will be mentioned below (they may not even be quirks, but they seem non-intuitive to me, so deserve a mention). It’s been a slightly painful process, as you may infer from the use of italics below, and we hope this post will illustrate some of the differences between writing plugins for Solr and Elasticsearch.
It’s probably worth noting that at least some of this write-up is speculative. I’m not privy to the internals of Elasticsearch, and have been building the plugin through a combination of looking at the Elasticsearch source code (as advised by the documentation) and running the same integration test over and over again for each of the various versions, and checking what was returned in the search response. There is very little in the way of documentation, and the 1.x versions of Elasticsearch have almost no comments or Javadoc in the code. It has been interesting and fun, and not at all exasperating or frustrating.
The plugin code can be broken down into three broad sections:
- A core module, containing code shared between the Elasticsearch and Solr versions of the plugin. Anything in this module should be search engine agnostic, and is dedicated to accessing and pulling data from ontologies, either via the OLS service (provided by the European Bioinformatics Institute, our partners in the BioSolr project) or more generally OWL files, and returning a structure which can be used by the plugins.
- The es-ontology-annotator-core module, which is shared between all versions of the plugin, and contains Elasticsearch-specific code to build the helper classes required to access the ontology data.
- The es-ontology-annotator-esx.x modules, which are specific to the various versions of Elasticsearch. So far, there are five of these (one of the more challenging aspects of this work has been that the Elasticsearch mapper structure has been evolving through the versions, as has some of the internal infrastructure supporting them):
- 1.3 – for ES 1.3
- 1.4 – for ES 1.4
- 1.5 – for ES 1.5 – 1.7
- 2.0 – for ES 2.0
- 2.1 – for ES 2.1.1
- 2.2 – for ES 2.2
I haven’t tried the plugin with any versions of ES earlier than 1.3. There was a change to the internal mapping classes between 1.4 and 1.5 (UpdateInPlaceHashMap was removed and replaced with CopyOnWriteHashMap), presumably for a Very Good Reason. Versions since 1.5 seem to be forward compatible with later 1.x versions.
All of the versions of the plugin work in the same way. You specify in your mapping that a particular field has the type “ontology”. There are various additional properties that can be set, depending on whether you’re using an OWL file or OLS as your ontology data source (specified in the README). When the data is indexed, any information in that field is assumed to be an IRI referring to an ontology record, and will be used to fetch as much data as required/possible for that ontology record. The data will then be added as sub-fields to the ontology fields.
The new data is not added to the _source field, which is the easy way of seeing what data is in a stored record. In order to retrieve the new data, you have two options:
- Grab the mapping for your index, and look through it for the sub-fields of your annotation field. Use as many of these as you need to populate the fields property in your search request, making sure you name them fully (ie. annotation.uri, annotation.label, annotation.child_uris).
- Add all of the fields to the fields property in your search request (ie.
"fields": [ "*" ]).
What you cannot do is add “annotation.*” to your search request to get all of the annotation subfields. At this stage, this doesn’t work. I’m still working out whether this is possible or not.
How it works
All of the versions work in a broadly similar fashion: the OntologyMapper class extends AbstractFieldMapper (Elasticsearch 1.x) or FieldMapper (Elasticsearch 2.x). The Mapper classes all have two internal classes:
- a TypeParser, which reads the mapper’s configuration from the mapping details (as initially specified by the user, and as also returned from the Mapper.toXContent method), and returns…
- a Builder, which constructs the mappers for the known sub-fields and ultimately builds the Mapper class. The sub-field mappers are all for string fields, with mappers for URI fields having tokenisation disabled, while the other fields have it enabled. All are both indexed and stored.
The Mapper parses the content of the initial field (the IRI for the ontology record), and adds the sub-fields to the record, as part of the Mapper.parse method call (this is the most significant part of the Mapper code). There are at least two ways of doing this, and the Elasticsearch source code has both depending on which Mapper class you look at. There is no indication in the source why you would use one method over the other. This helps with clarity, especially when things aren’t working as they should.
What makes life more interesting for the OntologyMapper class is that not all of the sub-fields are known at start time. If the user wishes to index additional relationships between nodes (“participates in”, “has disease location”, etc.), these are generated on the fly, and the sub-fields need to be added to the mapping. Figuring out how to do this, and also how to make sure those fields are returned when the use requests the mapping for the index, has been a particular challenge.
The TypeParser is called more than once during the indexing process. My initial assumption was that once the mapping details had been read from the user’s specification, the parser was “fixed,” and so you had to keep track of the sub-field mappers yourself. This is not the case. As noted above, the TypeParser can also be fed from the Mapper’s toXContent method (which generates the mapping seen when you call the _mapping endpoint). Elasticsearch versions 1.x didn’t seem to care particularly what toXContent returned, so long as it could be parsed without throwing a NullPointerException, but Elasticsearch versions 2.x actually check that all of the mapping configuration has been dealt with. This actually makes life easier internally – after the mapper has processed a record, at least some of the dynamic field mappings are known, so you can build the sub-field mappers in the Builder rather than having to build them on the fly during the Mapper.parse process.
The other non-trivial Mapper methods are:
- toXContent, as mentioned several times already. This generates the mapping output (ie. the definition of the field as seen when you look via the _mapping endpoint).
- merge, which seems to do a compatibility check between an incoming instance of the mapper and the current instance. I’ve added some checks to this, but no significant code. Several of the implementations of this method in the Elasticsearch source code simply contain comments to the effect of “will return to this later”, so it seems I’m not the only person who doesn’t understand how merge works, or why it is called.
- traverse (Elasticsearch 1.x) and iterator (Elasticsearch 2.x), which seem to do similar things – namely providing a means to iterate through the sub-field mappers. In Elasticsearch 1.x, the traverse method is explicitly called as part of the process to add the new (dynamic) mappers to the mapping, but this isn’t a requirement for Elasticsearch 2.x. Elasticsearch 1.x distinguished between ObjectMappers and FieldMappers, which doesn’t seem to be a distinction in Elasticsearch 2.x.
Comparisons with the Solr plugin
The Solr plugin works somewhat differently to the Elasticsearch one. The Solr plugin is implemented as an UpdateRequestProcessor, and adds new fields directly to the incoming record (it doesn’t add sub-fields). This makes the returned data less tidy, but also easier to handle, since all of the new fields have the same prefix and can therefore be handled directly. You don’t need to explicitly tell Solr to return the new fields – because they are all stored, they are all returned by default.
On the other hand, you still have to jump through some hoops to work out which fields are dynamically generated, if you need to do that (i.e. to add checkboxes to a form to search “has disease location” or other relationships) – you need to call Solr to retrieve the schema, and use that as the basis for working out which are the new fields. For Elasticsearch, you have to request the mapping for your index, and use that in a similar way.
Configuration in Solr requires modifying the solrconfig.xml, once the plugin JAR file is in place, but doesn’t require any changes to the schema. All of the Elasticsearch configuration happens in the mapping definition. This reflects the different ways of implementing the plugin for Solr. I don’t have a particular feeling for whether it would have been better to implement the Solr plugin as a new field type – I did investigate, and it seemed much harder to do this, but it might be worth re-visiting if there is time available.
The Solr plugin was much easier to write, simply because the documentation is better. The Solr wiki has a very useful base page for writing a new UpdateRequestProcessor, and the source code has plenty of comments and Javadoc (although it’s not perfect in this respect – SolrCoreAware has no documentation at all, has been present since Solr 1.3, and was a requirement for keeping track of the Ontology helper threads).
I will most likely update this post as I become aware of things I have done which are wrong, or any misinformation it contains. We’ll also be talking further about the BioSolr project at a workshop event on February 3rd/4th 2016. We welcome feedback and comments, of course – especially from the wider Elasticsearch developer community.