biosolr – Flax

Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Charlie Hull — Mon, 15 Feb 2016 11:32:13 +0000

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.

The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.

Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.

Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.

We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.

Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

Charlie Hull — Wed, 10 Feb 2016 10:26:00 +0000

The day started with a quick recap of the project from myself and Dr. Sameer Valenkar of the EBI. Eric Pugh, founder of Flax’s US partners Open Source Connections, followed with his Unofficial State of Solr, detailing the history of the project, recent innovations and what might happen in the future, including some very interesting new features allowing for parallel SQL queries. We then heard from Flax team members Tom Winch and Matt Pearce on how they have built faceting improvements, a new XJoin between Solr and external systems, researched federated search and developed ontology indexers (note that all of the software they’ve built is available as open source, and Tom has recently written extensively about XJoin).

After lunch we heard from Peter Meric of the NCBI (the US equivalent of the EBI) on a Solr-based system for searching gene data, to supplement the NCBI’s homegrown Entrez system. This is very much a filtered search rather than a text search and indexes around 330m records. He also talked about a High Availability prototype of a replacement for the very high traffic PubMed service built on Amazon Web Services. Each Solr, MongoDB or Zookeeper node ‘announces’ itself using a monitor service and then replicates data from a master node. Although it is not yet available as open source I think this project may be of great interest to the wider Solr community and I hope we hear more of it soon.

Next up was a brief talk by Dan Bolser of the EBI on an ‘old school’ scheme for sharding plant phenotype data – I’d seen part of this presentation before and it’s linked to our own ideas on federating search across bioinformatics data. Dan was followed by Lewis Geer of NCBI talking about the SEQR protein similarity search engine built on Solr. Although somewhat complex for us non-biologists to understand, this very clever system relies on experimental results to suggest which of the possible variants of a protein system are likely, and adds these to the Solr index – it reminded me of a similar approach we’ve used to store possible OCR errors when working with scanned newsprint. His team’s code is available. Dan Stainer of the Ensembl project was next discussing how his team are indexing tens of thousands of genomes from thousands of species, currently on a MySQL backend with a REST API and a lot of Perl. He discussed how they have been experimenting with Elasticsearch to index around 3.2bn items, creating a 782GB index which builds in around 5-6 hours, to provide new capabilities such as structured queries for their genome browser tools.

We then held an interactive hands-on session, covering subjects such as ‘getting started with Solr’ and exploring some of the code we’ve built such as XJoin, followed by a conference dinner in Hinxton Hall. It was clear that there is a huge range of use cases for search technology in the life sciences community and almost as many different ways to address them, and the after-dinner conversation was lively and highly interesting!

Most of the presentations are now available for download and we’ve also written about the second day of the event, where we shifted focus onto Elasticsearch and other technologies.

The post Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr appeared first on Flax.

XJoin for Solr, part 2: a click-through example

Tom Winch — Fri, 29 Jan 2016 09:39:00 +0000

In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use XJoin to make use of user click-through data to influence the score of products in searches.

I’ll step through things a bit quicker this time around and I’ll be using code from the last post so reading that first is highly recommended. I’ll assume that the prerequisites from last time have been installed and set up in the same directories.

The design

Suppose we have a web page for searching a collection of products, and when a user clicks on product listing in the result set (or perhaps, when they subsequently go on to buy that product – or both) we insert a record in an SQL database, storing the product id, the query terms they used, and an arbitrary weight value (which will depend on whether they merely clicked on a result, or if they went on to buy it, or some other behaviour such as mouse pointer tracking). We then want to use the click-through data stored in that database to boost products in searches that use those query terms again.

We could use the sum of the weights of all occurrences of a product id/query term combination as the product score boost, but then we might start to worry about a feedback process occurring. Alternatively, we might take the maximum or average weight across the occurrences. In the code below, we’ll use the maximum.

The advantage of this design over storing the click-through information in Solr is that you don’t have to update the Solr index every time there is user activity, which could become costly. An SQL database is much more suited to this task.

The external click-through API

Again, we’ll be using Python 3 (using the flask and sqlite3 modules) to implement the external API. I’ll be using this API to update the click-through database (by hand, for this example) as well as having Solr query it using XJoin. Here’s the code (partly based on code taken from here for caching the database connection in the Flask application context, and see here if you’re interested in more details about sqlite3’s support for full text search). Again, all the code written for this example is also available in the BioSolr GitHub repository:

from flask import Flask, request, g
import json
import sqlite3 as sql

# flask application context attribute for caching database connection
DB_APP_KEY = '_database'

# default weight for storing against queries
DEFAULT_WEIGHT = 1.0

app = Flask(__name__)

def get_db():
  """ Obtain a (cached) DB connection and return a cursor for it.
  """
  db = getattr(g, DB_APP_KEY, None)
  if db is None:
    db = sql.connect('click.db')
    setattr(g, DB_APP_KEY, db)
    c = db.cursor()
    c.execute("CREATE VIRTUAL TABLE IF NOT EXISTS click USING fts4 ("
                "id VARCHAR(256),"
                "q VARCHAR(256),"
                "weight FLOAT"
              ")")
    c.close()
  return db

@app.teardown_appcontext
def teardown_db(exception):
  db = getattr(g, DB_APP_KEY, None)
  if db is not None:
    db.close()

@app.route('/')
def main():
  return 'click-through API'

@app.route('/click/', methods=["PUT"])
def click(id):
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  try:
    w = float(request.args.get('weight', DEFAULT_WEIGHT))
  except ValueError:
    return 'Could not parse weight', 400

  # do the DB update
  db = get_db()
  try:
    c = db.cursor()
    c.execute("INSERT INTO click (id, q, weight) VALUES (?, ?, ?)", (id, q, w))
    db.commit()
    return 'OK'
  finally:
    c.close()

@app.route('/ids')
def ids():
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  
  # do the DB lookup
  try:
    c = get_db().cursor()
    c.execute("SELECT id, MAX(weight) FROM click WHERE q MATCH ? GROUP BY id", (q, ))
    return json.dumps([{ 'id': id, 'weight': w } for id, w in c])
  finally:
    c.close()

if __name__ == "__main__":
  app.run(port=8001, debug=True)

This web API exposes two end-points. First we have PUT /click/[id] which is used when we want to update the SQL database after a user click. For the purposes of this demonstration, we’ll be hitting this end-point by hand using curl to avoid having to write a web UI. The other end-point, GET /ids?[query terms], is used by our XJoin component and returns a JSON-formatted array of id/weight objects where the query terms from the database match those given in the query string.

Java glue code

Now we just need the Java glue code that sits between the XJoin component and our external API. Here’s an implementation of XJoinResultsFactory that does what we need:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class ClickXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
  }

  /**
   * Use 'click' REST API to fetch current click data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    String q = URLEncoder.encode(params.get("q"), "UTF-8");
    String apiUrl = url + "?q=" + q;
    try (HttpConnection http = new HttpConnection(apiUrl)) {
      JsonArray products = (JsonArray)http.getJson();
      return new ClickResults(products);
    }
  }
    
  public class ClickResults implements XJoinResults {
    private Map clickMap;
    
    public ClickResults(JsonArray products) {
      clickMap = new HashMap<>();
      for (JsonValue product : products) {
        JsonObject object = (JsonObject)product;
        String id = object.getString("id");
        double weight = object.getJsonNumber("weight").doubleValue();
        clickMap.put(id, new Click(id, weight));
      }
    }
    
    public int getCount() {
      return clickMap.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      return clickMap.keySet();
    }

    @Override
    public Object getResult(String id) {
      return clickMap.get(id);
    }      
  }
  
  public class Click {
    
    private String id;
    private double weight;
    
    public Click(String id, double weight) {
      this.id = id;
      this.weight = weight;
    }
    
    public String getId() {
      return id;
    }
    
    public double getWeight() {
      return weight;
    } 
  }
}

Unlike the previous example, this time getResults() does depend on the SolrParams argument, so that the user’s query, q, is passed to the external API. Store this Java source in blog/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java and compile into a JAR (again, we also need the HttpConnection class from the last blog post as well as javax.json-1.0.4.jar):

blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java
blog$ jar cvf click.jar -C bin .

Solr configuration

Starting with a fresh version of solrconfig.xml, insert these lines near the start to import the XJoin and user JARs (substitute /XXX with the full path to the parent of the blog directory):

And our request handler configuration:




  weight
  0.0



  uk.co.flax.examples.xjoin.ClickXJoinResultsFactory
  id
  
    http://localhost:8001/ids
  



  
    json
    none
    edismax
    description
    *

    false
    count
    *
  
  
    x_click
  
  
    x_click

Reload the Solr core (products) to get the new config in place.

Putting the pieces together

The following query will verify our Solr setup (remembering to escape curly brackets):

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&fl=id,name,score&rows=4' | jq .

I’ve used Solr parameter substitution with the q/qq parameters which will simplify later queries (this has been in Solr since 5.1). This query returns:

{
  "responseHeader": {
    "status": 0,
    "QTime": 25
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 2.9939778,
    "docs": [
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.8712361
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 1.8712361
      }
    ]
  }
}

Some repeat products in the data, but so far, so good. Next, get the click-through API running:

blog$ python3 click.py

And check it’s working (this should return [] whatever query is chosen because the click-through database is empty):

curl localhost:8001/ids?q=software | jq .

Now, let’s populate the click-through database by simulating user activity. Suppose, given the above product results, the user goes on to click through to the fourth product (or even buy it). Then, the UI would update the click web API to indicate this has happened. Let’s do this by hand, specifying the product id, the user’s query, and a weight score (here, I’ll use the value 3, supposing the user bought the product in the end):

curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/9200068133591804002?q=excel&weight=3'

Now, we can check the output that XJoin will see when using the click-through API:

blog$ curl localhost:8001/ids?q=excel | jq .

giving:

[
  {
    "weight": 3,
    "id": "http://www.google.com/base/feeds/snippets/9200068133591804002"
  }
]

Using the bf edismax parameter and the weight function set up in solrconfig.xml to extract the weight value from the external results stored in the x_click XJoin search component, we can boost product scores when they appear in the click-through database for the user’s query:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

which gives:

{
  "responseHeader": {
    "status": 0,
    "QTime": 13
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 3.2224145,
    "docs": [
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 3.2224145
      },
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.5559989
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
          "weight": 3
        }
      }
    ]
  }
}

Lo and behold, the product the user clicked on now appears top of the Solr results for the that query. Have a play with the API, generate some more user activity and see how this effects subsequent queries. It will cope fine with multiple-word queries, for example, suppose a user searches for ‘games software’:

curl 'http://localhost:8983/solr/products/xjoin?qq=games+software&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

There being no relevant queries in the click-through database, this has the same results as for a query without the XJoin, and as we can see, the value of response.x_click.count is 0:

{
  "responseHeader": {
    "status": 0,
    "QTime": 15
  },
  "response": {
    "numFound": 1158,
    "start": 0,
    "maxScore": 0.91356516,
    "docs": [
      {
        "name": "encore software 10568 - encore hoyle puzzle & board games 2005 - complete product - puzzle game - 1 user - complete product - standard - pc",
        "id": "http://www.google.com/base/feeds/snippets/4998847858583359731",
        "score": 0.91356516
      },
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 0.8699497
      },
      {
        "name": "encore software 10027 - hoyle board games (win 98 me 2000 xp)",
        "id": "http://www.google.com/base/feeds/snippets/8664755713112971171",
        "score": 0.85982025
      },
      {
        "name": "encore software 11253 - brain food games: cranium collection 2006 sb cs by encore",
        "id": "http://www.google.com/base/feeds/snippets/15401280256033043239",
        "score": 0.78744644
      }
    ]
  },
  "x_click": {
    "count": 0,
    "external": []
  }
}

Now let’s simulate the same user clicking on the second product (with default weight):

blog$ curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/826668451451666270?q=games+software'

Next, suppose another user then searches for just ‘games’:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=games&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

In the results, we see the ‘wild games’ product boosted to the top:

{
  "responseHeader": {
    "status": 0,
    "QTime": 60
  },
  "response": {
    "numFound": 212,
    "start": 0,
    "maxScore": 1.3652229,
    "docs": [
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 1.3652229
      },
      {
        "name": "xbox 360: ddr universe",
        "id": "http://www.google.com/base/feeds/snippets/16659259513615352372",
        "score": 0.95894843
      },
      {
        "name": "south park chef's luv shack",
        "id": "http://www.google.com/base/feeds/snippets/11648097795915093399",
        "score": 0.95894843
      },
      {
        "name": "egames. inc casual games pack",
        "id": "http://www.google.com/base/feeds/snippets/16700933768709687512",
        "score": 0.89483213
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
          "weight": 1
        }
      }
    ]
  }
}

Extensions

Of course, this approach can be extended to add in more sophisticated weighting and boosting strategies, or include more data about the user activity than just a simple weight score, which could be used to augment the display of the product in the UI (for example, “ten customers in the UK bought this product in the last month”).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

The fun and frustration of writing a plugin for Elasticsearch for ontology indexing

Matt Pearce — Wed, 27 Jan 2016 10:15:11 +0000

As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr – thus the name – we have also been developing some features for Elasticsearch). These are now largely working, with some quirks which will be mentioned below (they may not even be quirks, but they seem non-intuitive to me, so deserve a mention). It’s been a slightly painful process, as you may infer from the use of italics below, and we hope this post will illustrate some of the differences between writing plugins for Solr and Elasticsearch.

It’s probably worth noting that at least some of this write-up is speculative. I’m not privy to the internals of Elasticsearch, and have been building the plugin through a combination of looking at the Elasticsearch source code (as advised by the documentation) and running the same integration test over and over again for each of the various versions, and checking what was returned in the search response. There is very little in the way of documentation, and the 1.x versions of Elasticsearch have almost no comments or Javadoc in the code. It has been interesting and fun, and not at all exasperating or frustrating.

The code

The plugin code can be broken down into three broad sections:

A core module, containing code shared between the Elasticsearch and Solr versions of the plugin. Anything in this module should be search engine agnostic, and is dedicated to accessing and pulling data from ontologies, either via the OLS service (provided by the European Bioinformatics Institute, our partners in the BioSolr project) or more generally OWL files, and returning a structure which can be used by the plugins.
The es-ontology-annotator-core module, which is shared between all versions of the plugin, and contains Elasticsearch-specific code to build the helper classes required to access the ontology data.
The es-ontology-annotator-esx.x modules, which are specific to the various versions of Elasticsearch. So far, there are five of these (one of the more challenging aspects of this work has been that the Elasticsearch mapper structure has been evolving through the versions, as has some of the internal infrastructure supporting them):
- 1.3 – for ES 1.3
- 1.4 – for ES 1.4
- 1.5 – for ES 1.5 – 1.7
- 2.0 – for ES 2.0
- 2.1 – for ES 2.1.1
- 2.2 – for ES 2.2

I haven’t tried the plugin with any versions of ES earlier than 1.3. There was a change to the internal mapping classes between 1.4 and 1.5 (UpdateInPlaceHashMap was removed and replaced with CopyOnWriteHashMap), presumably for a Very Good Reason. Versions since 1.5 seem to be forward compatible with later 1.x versions.

The quirks

All of the versions of the plugin work in the same way. You specify in your mapping that a particular field has the type “ontology”. There are various additional properties that can be set, depending on whether you’re using an OWL file or OLS as your ontology data source (specified in the README). When the data is indexed, any information in that field is assumed to be an IRI referring to an ontology record, and will be used to fetch as much data as required/possible for that ontology record. The data will then be added as sub-fields to the ontology fields.

The new data is not added to the _source field, which is the easy way of seeing what data is in a stored record. In order to retrieve the new data, you have two options:

Grab the mapping for your index, and look through it for the sub-fields of your annotation field. Use as many of these as you need to populate the fields property in your search request, making sure you name them fully (ie. annotation.uri, annotation.label, annotation.child_uris).
Add all of the fields to the fields property in your search request (ie. "fields": [ "*" ]).

What you cannot do is add “annotation.*” to your search request to get all of the annotation subfields. At this stage, this doesn’t work. I’m still working out whether this is possible or not.

How it works

All of the versions work in a broadly similar fashion: the OntologyMapper class extends AbstractFieldMapper (Elasticsearch 1.x) or FieldMapper (Elasticsearch 2.x). The Mapper classes all have two internal classes:

a TypeParser, which reads the mapper’s configuration from the mapping details (as initially specified by the user, and as also returned from the Mapper.toXContent method), and returns…
a Builder, which constructs the mappers for the known sub-fields and ultimately builds the Mapper class. The sub-field mappers are all for string fields, with mappers for URI fields having tokenisation disabled, while the other fields have it enabled. All are both indexed and stored.

The Mapper parses the content of the initial field (the IRI for the ontology record), and adds the sub-fields to the record, as part of the Mapper.parse method call (this is the most significant part of the Mapper code). There are at least two ways of doing this, and the Elasticsearch source code has both depending on which Mapper class you look at. There is no indication in the source why you would use one method over the other. This helps with clarity, especially when things aren’t working as they should.

What makes life more interesting for the OntologyMapper class is that not all of the sub-fields are known at start time. If the user wishes to index additional relationships between nodes (“participates in”, “has disease location”, etc.), these are generated on the fly, and the sub-fields need to be added to the mapping. Figuring out how to do this, and also how to make sure those fields are returned when the use requests the mapping for the index, has been a particular challenge.

The TypeParser is called more than once during the indexing process. My initial assumption was that once the mapping details had been read from the user’s specification, the parser was “fixed,” and so you had to keep track of the sub-field mappers yourself. This is not the case. As noted above, the TypeParser can also be fed from the Mapper’s toXContent method (which generates the mapping seen when you call the _mapping endpoint). Elasticsearch versions 1.x didn’t seem to care particularly what toXContent returned, so long as it could be parsed without throwing a NullPointerException, but Elasticsearch versions 2.x actually check that all of the mapping configuration has been dealt with. This actually makes life easier internally – after the mapper has processed a record, at least some of the dynamic field mappings are known, so you can build the sub-field mappers in the Builder rather than having to build them on the fly during the Mapper.parse process.

The other non-trivial Mapper methods are:

toXContent, as mentioned several times already. This generates the mapping output (ie. the definition of the field as seen when you look via the _mapping endpoint).
merge, which seems to do a compatibility check between an incoming instance of the mapper and the current instance. I’ve added some checks to this, but no significant code. Several of the implementations of this method in the Elasticsearch source code simply contain comments to the effect of “will return to this later”, so it seems I’m not the only person who doesn’t understand how merge works, or why it is called.
traverse (Elasticsearch 1.x) and iterator (Elasticsearch 2.x), which seem to do similar things – namely providing a means to iterate through the sub-field mappers. In Elasticsearch 1.x, the traverse method is explicitly called as part of the process to add the new (dynamic) mappers to the mapping, but this isn’t a requirement for Elasticsearch 2.x. Elasticsearch 1.x distinguished between ObjectMappers and FieldMappers, which doesn’t seem to be a distinction in Elasticsearch 2.x.

Comparisons with the Solr plugin

The Solr plugin works somewhat differently to the Elasticsearch one. The Solr plugin is implemented as an UpdateRequestProcessor, and adds new fields directly to the incoming record (it doesn’t add sub-fields). This makes the returned data less tidy, but also easier to handle, since all of the new fields have the same prefix and can therefore be handled directly. You don’t need to explicitly tell Solr to return the new fields – because they are all stored, they are all returned by default.

On the other hand, you still have to jump through some hoops to work out which fields are dynamically generated, if you need to do that (i.e. to add checkboxes to a form to search “has disease location” or other relationships) – you need to call Solr to retrieve the schema, and use that as the basis for working out which are the new fields. For Elasticsearch, you have to request the mapping for your index, and use that in a similar way.

Configuration in Solr requires modifying the solrconfig.xml, once the plugin JAR file is in place, but doesn’t require any changes to the schema. All of the Elasticsearch configuration happens in the mapping definition. This reflects the different ways of implementing the plugin for Solr. I don’t have a particular feeling for whether it would have been better to implement the Solr plugin as a new field type – I did investigate, and it seemed much harder to do this, but it might be worth re-visiting if there is time available.

The Solr plugin was much easier to write, simply because the documentation is better. The Solr wiki has a very useful base page for writing a new UpdateRequestProcessor, and the source code has plenty of comments and Javadoc (although it’s not perfect in this respect – SolrCoreAware has no documentation at all, has been present since Solr 1.3, and was a requirement for keeping track of the Ontology helper threads).

I will most likely update this post as I become aware of things I have done which are wrong, or any misinformation it contains. We’ll also be talking further about the BioSolr project at a workshop event on February 3rd/4th 2016. We welcome feedback and comments, of course – especially from the wider Elasticsearch developer community.

The post The fun and frustration of writing a plugin for Elasticsearch for ontology indexing appeared first on Flax.

Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life

Charlie Hull — Fri, 16 Oct 2015 13:17:50 +0000

BioSolr – Searching the stuff of life – Lucene/Solr Revolution 2015 from Charlie Hull

The post Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life appeared first on Flax.

Open source search events roundup for late 2015

Charlie Hull — Wed, 29 Jul 2015 10:47:06 +0000

Although it’s still high summer here in the UK (which means it’s probably raining) we’re already looking forward to the autumn and the events across the world we’re attending. In early September we’re running another free to attend London Lucene/Solr Usergroup Meetup, sponsored this time by Blackrock who are talking about using Solr for websites. At the end of September there is another Elasticsearch London Meetup which we will also attend (and may be speaking at this time).

October brings the biggest event in the Lucene/Solr calendar, Lucene Revolution in Austin, Texas, a 4-day event with training and a conference. We’re happy to announce that Alan Woodward and Matt Pearce from Flax will be presenting “Searching the Stuff of Life: BioSolr” about our work with the European Bioinformatics Institute where we’ve been developing Solr features for use by bioinformaticians (and any others who find them useful of course!), for example ontology indexing and external JOINs.

A week later we’ll be at Enterprise Search Europe, where I’ll be delivering the keynote on The Future of Search (you can see an earlier version of this talk from the IKO Singapore conference last month). We’re also running a Meetup on the evening of the 20th open to both conference attendees and others – an informal chance to chat with other search folks. During the conference itself I’m particularly looking forward to hearing from Ian Williams of NHS Wales on Powering the Single Patient Record in NHS Wales with Apache Solr – this is a very large scale and exciting project using Solr for healthcare data.

Looking further ahead, in November ~~we have plans to attend (and possibly speak)~~ I’m speaking on Test Driven Relevancy at Search Solutions 2015, a great one-day event in London which I highly recommend, and we are ~~planning another event~~ running a workshop on Taming Enterprise Search in Singapore together with a partner. As ever, do let us know if you would like to meet up at an event and talk open source search!

The post Open source search events roundup for late 2015 appeared first on Flax.

BioSolr at BOSC 2015 – open source search for bioinformatics

Charlie Hull — Mon, 13 Jul 2015 08:31:30 +0000

Matt Pearce writes:

I spent most of last Friday at the Bioinformatics Open Source Conference (BOSC) Special Interest Group meeting in Dublin, as part of this year’s ISMB/ECCB conference. Tony Burdett from EMBL-EBI was giving a quick talk about the BioSolr project, and I went along to speak to people at the poster session afterwards about what we are doing, and how other teams could get involved.

Unfortunately, I missed the first half of Holly Bik’s keynote (registration seemed to take forever, hindered by dubious wifi and a printer that refused to cooperate), which used the vintage Oregon Trail game as an great analogy for biologists getting into bioinformatics – there are many, frequently intimidating, options when choosing how to analyse data, and picking the right one can be scary (this is something that definitely applies to the areas we work in as well).

There was a new approach to the traditional Q&A session afterwards as well, with questions being submitted on cards around the room, and via a Twitter hashtag. This worked pretty well, although Twitter latency did slow things down a couple of times, and there were a few shouted-out questions from the floor, but certainly better than having volunteers with microphones trying to reach the questioner across rows of people.

The morning session was on Data Science, and while a number of the talks went over my head somewhat, it was interesting to see how tools like Hadoop are being used in Bioinformatics. It was good to see the spirit of collaboration in action too, with Sebastian Schoenherr’s talk about CloudGene, a project that came about following an earlier BOSC that implements a graphical front end for Hadoop. Tony’s talk about BioSolr went down well – the show of hands for people in the room using Lucene, Solr and/or Elasticsearch indicated around 75% there were using search engines in some form. This backs up our earlier experience at the EBI, where the first BioSolr workshop was attended by teams from all over the campus, using Lucene or Solr in various versions to store and search their data.

Crossing over with lunch was the poster session, where Tony and I spoke to people about BioSolr. The Jalview team seemed especially interested in potential cross-over with their project, and there was plenty of interest generally in how the various extensions we have worked on (X-Join, hierarchical faceting) could be fitted into other projects.

The afternoon session was on the subject of Standards and Interoperability, starting with a great talk from Michael Crusoe about the Common Workflow Language, which started life at the BOSC 2014 codefest. There were several talks about Galaxy, a cloud-based platform for sharing data analyses, linking many other tools to allow workflows to be reproduced. Bruno Vieira’s talk about BioNode was also very interesting, and I made notes to check out oSwitch when time is available.

I had to leave before the afternoon’s panel took place, but all in all it was a very interesting day learning how open source software is being used outside of the areas I usually work in.

The post BioSolr at BOSC 2015 – open source search for bioinformatics appeared first on Flax.

BioSolr begins with a workshop day

Charlie Hull — Thu, 02 Oct 2014 12:11:47 +0000

Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

The post BioSolr begins with a workshop day appeared first on Flax.

BioSolr – building better search for bioinformatics

Charlie Hull — Wed, 11 Jun 2014 09:39:26 +0000

The entire Flax technical team spent the day at the European Bioinformatics Institute yesterday discussing an exciting new project we’ll begin this coming September, BioSolr. Funded by the BBSRC this collaboration between Flax and the EBI aims “to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software”. Here we are with Dr. Sameer Valenkar and Gautier Koscielny of the EBI.

The EBI, located on the Wellcome Trust Genome Campus near Cambridge, maintains the world’s most comprehensive range of freely available and up-to-date molecular databases and is already using Apache Lucene/Solr extensively, for example in the Protein Databank in Europe which indexes over 100,000 items derived from experimental research – but this is just one of the many complex collections they provide. The BioSolr project will run for a full year, during which members of the Flax team will work directly with the EBI team to run workshops, demonstrate and document best practises in search application design, create, improve and extend open source software and learn a lot about the specialist search requirements of bioinformatics. This is a fantastic opportunity for us to push the boundaries of what is possible with Solr and associated software, to work with some incredibly rich data and to do all of this in the open to encourage collaboration from the wider software and biology communities.

We’ll be creating various open resources (software repositories, Wikis, blogs) to support the project later this year – do let us know if you would like to be involved and we will keep you informed.

The post BioSolr – building better search for bioinformatics appeared first on Flax.