ontology – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 The fun and frustration of writing a plugin for Elasticsearch for ontology indexing http://www.flax.co.uk/blog/2016/01/27/fun-frustration-writing-plugin-elasticsearch-ontology-indexing/ http://www.flax.co.uk/blog/2016/01/27/fun-frustration-writing-plugin-elasticsearch-ontology-indexing/#comments Wed, 27 Jan 2016 10:15:11 +0000 http://www.flax.co.uk/?p=2954 As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we … More

The post The fun and frustration of writing a plugin for Elasticsearch for ontology indexing appeared first on Flax.

]]>
As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we have also been developing some features for Elasticsearch). These are now largely working, with some quirks which will be mentioned below (they may not even be quirks, but they seem non-intuitive to me, so deserve a mention). It’s been a slightly painful process, as you may infer from the use of italics below, and we hope this post will illustrate some of the differences between writing plugins for Solr and Elasticsearch.

It’s probably worth noting that at least some of this write-up is speculative. I’m not privy to the internals of Elasticsearch, and have been building the plugin through a combination of looking at the Elasticsearch source code (as advised by the documentation) and running the same integration test over and over again for each of the various versions, and checking what was returned in the search response. There is very little in the way of documentation, and the 1.x versions of Elasticsearch have almost no comments or Javadoc in the code. It has been interesting and fun, and not at all exasperating or frustrating.

The code

The plugin code can be broken down into three broad sections:

  • A core module, containing code shared between the Elasticsearch and Solr versions of the plugin. Anything in this module should be search engine agnostic, and is dedicated to accessing and pulling data from ontologies, either via the OLS service (provided by the European Bioinformatics Institute, our partners in the BioSolr project) or more generally OWL files, and returning a structure which can be used by the plugins.
  • The es-ontology-annotator-core module, which is shared between all versions of the plugin, and contains Elasticsearch-specific code to build the helper classes required to access the ontology data.
  • The es-ontology-annotator-esx.x modules, which are specific to the various versions of Elasticsearch. So far, there are five of these (one of the more challenging aspects of this work has been that the Elasticsearch mapper structure has been evolving through the versions, as has some of the internal infrastructure supporting them):
    • 1.3 – for ES 1.3
    • 1.4 – for ES 1.4
    • 1.5 – for ES 1.5 – 1.7
    • 2.0 – for ES 2.0
    • 2.1 – for ES 2.1.1
    • 2.2 – for ES 2.2

I haven’t tried the plugin with any versions of ES earlier than 1.3. There was a change to the internal mapping classes between 1.4 and 1.5 (UpdateInPlaceHashMap was removed and replaced with CopyOnWriteHashMap), presumably for a Very Good Reason. Versions since 1.5 seem to be forward compatible with later 1.x versions.

The quirks

All of the versions of the plugin work in the same way. You specify in your mapping that a particular field has the type “ontology”. There are various additional properties that can be set, depending on whether you’re using an OWL file or OLS as your ontology data source (specified in the README). When the data is indexed, any information in that field is assumed to be an IRI referring to an ontology record, and will be used to fetch as much data as required/possible for that ontology record. The data will then be added as sub-fields to the ontology fields.

The new data is not added to the _source field, which is the easy way of seeing what data is in a stored record. In order to retrieve the new data, you have two options:

  • Grab the mapping for your index, and look through it for the sub-fields of your annotation field. Use as many of these as you need to populate the fields property in your search request, making sure you name them fully (ie. annotation.uri, annotation.label, annotation.child_uris).
  • Add all of the fields to the fields property in your search request (ie. "fields": [ "*" ]).

What you cannot do is add “annotation.*” to your search request to get all of the annotation subfields. At this stage, this doesn’t work. I’m still working out whether this is possible or not.

How it works

All of the versions work in a broadly similar fashion: the OntologyMapper class extends AbstractFieldMapper (Elasticsearch 1.x) or FieldMapper (Elasticsearch 2.x). The Mapper classes all have two internal classes:

  • a TypeParser, which reads the mapper’s configuration from the mapping details (as initially specified by the user, and as also returned from the Mapper.toXContent method), and returns…
  • a Builder, which constructs the mappers for the known sub-fields and ultimately builds the Mapper class. The sub-field mappers are all for string fields, with mappers for URI fields having tokenisation disabled, while the other fields have it enabled. All are both indexed and stored.

The Mapper parses the content of the initial field (the IRI for the ontology record), and adds the sub-fields to the record, as part of the Mapper.parse method call (this is the most significant part of the Mapper code). There are at least two ways of doing this, and the Elasticsearch source code has both depending on which Mapper class you look at. There is no indication in the source why you would use one method over the other. This helps with clarity, especially when things aren’t working as they should.

What makes life more interesting for the OntologyMapper class is that not all of the sub-fields are known at start time. If the user wishes to index additional relationships between nodes (“participates in”, “has disease location”, etc.), these are generated on the fly, and the sub-fields need to be added to the mapping. Figuring out how to do this, and also how to make sure those fields are returned when the use requests the mapping for the index, has been a particular challenge.

The TypeParser is called more than once during the indexing process. My initial assumption was that once the mapping details had been read from the user’s specification, the parser was “fixed,” and so you had to keep track of the sub-field mappers yourself. This is not the case. As noted above, the TypeParser can also be fed from the Mapper’s toXContent method (which generates the mapping seen when you call the _mapping endpoint). Elasticsearch versions 1.x didn’t seem to care particularly what toXContent returned, so long as it could be parsed without throwing a NullPointerException, but Elasticsearch versions 2.x actually check that all of the mapping configuration has been dealt with. This actually makes life easier internally – after the mapper has processed a record, at least some of the dynamic field mappings are known, so you can build the sub-field mappers in the Builder rather than having to build them on the fly during the Mapper.parse process.

The other non-trivial Mapper methods are:

  • toXContent, as mentioned several times already. This generates the mapping output (ie. the definition of the field as seen when you look via the _mapping endpoint).
  • merge, which seems to do a compatibility check between an incoming instance of the mapper and the current instance. I’ve added some checks to this, but no significant code. Several of the implementations of this method in the Elasticsearch source code simply contain comments to the effect of “will return to this later”, so it seems I’m not the only person who doesn’t understand how merge works, or why it is called.
  • traverse (Elasticsearch 1.x) and iterator (Elasticsearch 2.x), which seem to do similar things – namely providing a means to iterate through the sub-field mappers. In Elasticsearch 1.x, the traverse method is explicitly called as part of the process to add the new (dynamic) mappers to the mapping, but this isn’t a requirement for Elasticsearch 2.x. Elasticsearch 1.x distinguished between ObjectMappers and FieldMappers, which doesn’t seem to be a distinction in Elasticsearch 2.x.

Comparisons with the Solr plugin

The Solr plugin works somewhat differently to the Elasticsearch one. The Solr plugin is implemented as an UpdateRequestProcessor, and adds new fields directly to the incoming record (it doesn’t add sub-fields). This makes the returned data less tidy, but also easier to handle, since all of the new fields have the same prefix and can therefore be handled directly. You don’t need to explicitly tell Solr to return the new fields – because they are all stored, they are all returned by default.

On the other hand, you still have to jump through some hoops to work out which fields are dynamically generated, if you need to do that (i.e. to add checkboxes to a form to search “has disease location” or other relationships) – you need to call Solr to retrieve the schema, and use that as the basis for working out which are the new fields. For Elasticsearch, you have to request the mapping for your index, and use that in a similar way.

Configuration in Solr requires modifying the solrconfig.xml, once the plugin JAR file is in place, but doesn’t require any changes to the schema. All of the Elasticsearch configuration happens in the mapping definition. This reflects the different ways of implementing the plugin for Solr. I don’t have a particular feeling for whether it would have been better to implement the Solr plugin as a new field type – I did investigate, and it seemed much harder to do this, but it might be worth re-visiting if there is time available.

The Solr plugin was much easier to write, simply because the documentation is better. The Solr wiki has a very useful base page for writing a new UpdateRequestProcessor, and the source code has plenty of comments and Javadoc (although it’s not perfect in this respect – SolrCoreAware has no documentation at all, has been present since Solr 1.3, and was a requirement for keeping track of the Ontology helper threads).

I will most likely update this post as I become aware of things I have done which are wrong, or any misinformation it contains. We’ll also be talking further about the BioSolr project at a workshop event on February 3rd/4th 2016. We welcome feedback and comments, of course – especially from the wider Elasticsearch developer community.

The post The fun and frustration of writing a plugin for Elasticsearch for ontology indexing appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/27/fun-frustration-writing-plugin-elasticsearch-ontology-indexing/feed/ 6
Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life http://www.flax.co.uk/blog/2015/10/16/lucenesolr-revolution-2015-biosolr-searching-the-stuff-of-life/ http://www.flax.co.uk/blog/2015/10/16/lucenesolr-revolution-2015-biosolr-searching-the-stuff-of-life/#respond Fri, 16 Oct 2015 13:17:50 +0000 http://www.flax.co.uk/?p=2737 BioSolr – Searching the stuff of life – Lucene/Solr Revolution 2015 from Charlie Hull

The post Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life appeared first on Flax.

]]>

The post Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/10/16/lucenesolr-revolution-2015-biosolr-searching-the-stuff-of-life/feed/ 0
ISKO UK – Taming the News Beast http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/ http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/#respond Wed, 02 Apr 2014 11:55:31 +0000 http://www.flax.co.uk/blog/?p=1180 I spent yesterday afternoon at UCL for ISKO UK‘s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it … More

The post ISKO UK – Taming the News Beast appeared first on Flax.

]]>
I spent yesterday afternoon at UCL for ISKO UK‘s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it up in a nice secure ontology. There were around 90 people attending from news, content, technology and academic organisations, including quite a few young journalism students visiting London from Missouri.

The first talk was by Matt Shearer of BBC News Labs who described how they are working on automatically extracting entities from video/audio content (including verbatim transcripts, contributors using face/voice recognition, objects using audio/image recognition, topics, actions and non-verbal events including clapping). Their prototype ‘Juicer’ extractor currently works with around 680,000 source items and applies 5.7 million tags – which represents around 9 man years for a manual tagger. They are using Stanford NLP and DBpedia heavily, as well as an internal BBC project ‘Mango’ – I hope that some of the software they are developing is eventually open sourced as after all this is a publically-funded broadcaster. His colleague Jeremy Tarling was next and described a News Storyline concept they had been working on a new basis for the BBC News website (which apparently hasn’t changed much in 17 years, and still depends on a lot of manual tagging by journalists). The central concept of a storyline (e.g. ‘US spy scandal’) can form a knowledge graph, linked to events (‘Snowden leaves airport’), videos, ‘explainer’ stories, background items etc. Topics can be used to link storylines together. This was a fascinating idea, well explained and something other news organisations should certainly take note of.

Next was Rob Corrao of LAC Group describing how they had helped ABC News revolutionize their existing video library which contains over 2 million assets. They streamlined the digitization process, moved little-used analogue assets out of expensive physical storage, re-organised teams and shift patterns and created a portal application to ease access to the new ‘video library as a service’. There was a focus on deep reviews of existing behaviour and a pragmatic approach to what did and didn’t need to be digitized. This was a talk more about process and management rather than technology but the numbers were impressive: at the end of the project they were handling twice the volume with half the people.

Ian Roberts from the University of Sheffield then described AnnoMarket, a cloud-based market platform for text analytics, which wraps the rather over-complex open source GATE project in an API with easy scalability. As they have focused on precision over recall, AnnoMarket beats other cloud-based NLP services such as OpenCalais and TextRazor in terms of accuracy, and can process impressive volumes of documents (10 million in a few hours was quoted). They have developed custom pipelines for news, biomedical and Twitter content with the former linked into the Press Association‘s ontology (PA is a partner in AnnoMarket). For those wanting to carry out entity extraction and similar processes on large volumes of content at low cost AnnoMarket certainly looks attractive.

Next was Pete Sowerbutts of PA on the prototype interface he had helped develop for tagging all of PA’s 3000 daily news stories with entity information. I hadn’t known how influential PA is in the UK news sector – apparently 30% of all UK news is a direct copy of a PA feed and they estimate 70% is influenced by PA’s content. The UI showed how entities that have been automatically extracted can be easily confirmed by PA’s staff, allowing for confirmation that the right entity is being used (the example being Chris Evans who could be both a UK MP, a television personality and an American actor). One would assume the extractor produces some kind of confidence measure which begs the question whether every single entity must be manually confirmed – but then again, PA must retain their reputation for high quality.

The event finished with a brief open discussion featuring some of the speakers on an informal panel, followed by networking over drinks and snacks. Thanks to all at ISKO especially Helen Lippell for organising what proved to be a very interesting day.

The post ISKO UK – Taming the News Beast appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/04/02/isko-uk-taming-the-news-beast/feed/ 0
Cambridge Search Meetup – six degrees of ontology and Elasticsearching products http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/ http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/#respond Fri, 07 Mar 2014 10:40:39 +0000 http://www.flax.co.uk/blog/?p=1156 Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an … More

The post Cambridge Search Meetup – six degrees of ontology and Elasticsearching products appeared first on Flax.

]]>
Last Wednesday evening the Cambridge Search Meetup was held with too very different talks – we started with Zoë Rose, an information architect who has lent her expertise to Proquest, the BBC and now the UK Government. She gave an engaging talk on ontologies, showing how they can be useful for describing things that don’t easily fit into traditional taxonomies and how they can even be used for connecting Emperor Hirohito of Japan to Kevin Bacon in less than six steps. Along the way we learnt about sea creatures that lose their spines, Zoë’s very Australian dislike of jellyfish and other stinging sea dwellers and her own self-cleaning fish tank at home.

As search developers, we’re often asked to work with both taxonomies and ontologies and the challenge is how to represent them in a flat, document-focused index – perhaps ontologies are better represented by linked data stores such as provided by Apache Marmotta.

Next was Jurgen Van Gael of Rangespan, a company that provide an easy way for retailers to expand their online inventory beyond what is available in brick-and-mortar stores (customers include Tesco, Argos and Staples). Jurgen described how product data is gathered into MongoDB and MySQL databases, processed and cleaned up on a Apache Hadoop cluster and finally indexed using Elasticsearch to provide a search application for Rangespan’s customers. Although indexing of 50 million items takes only 75 minutes, most of the source data is only updated daily. Jurgen described how heirarchical facets are available and also how users may create ‘shortlists’ of products they may be interested in – which are stored directly in Elasticsearch itself, acting as a simple NoSQL database. For me one of the interesting points from his talk was why Elasticsearch was chosen as a technology – it was tried during a hack day, found to be very easy to scale and to get started with and then quickly became a significant part of their technology stack. Some years ago implementing this kind of product search over tens of millions of items would have been a significant challenge (not to mention very expensive) – with Elasticsearch and other open source software this is simply no longer the case.

Networking and socialising continued into the evening, with live music in the pub downstairs. Thanks to everyone who came and in particular our two speakers. We’ll be back with another Meetup soon!

The post Cambridge Search Meetup – six degrees of ontology and Elasticsearching products appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/03/07/cambridge-search-meetup-six-degrees-of-ontology-and-elasticsearching-products/feed/ 0
Search Solutions 2011 review http://www.flax.co.uk/blog/2011/11/17/search-solutions-2011-review/ http://www.flax.co.uk/blog/2011/11/17/search-solutions-2011-review/#comments Thu, 17 Nov 2011 16:29:38 +0000 http://www.flax.co.uk/blog/?p=663 I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search. The day started with a talk by John Tait on the … More

The post Search Solutions 2011 review appeared first on Flax.

]]>
I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling‘s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.

The post Search Solutions 2011 review appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2011/11/17/search-solutions-2011-review/feed/ 1