Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

 

 

 

 

 

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.

The day started with a quick recap of the project from myself and Dr. Sameer Valenkar of the EBI. Eric Pugh, founder of Flax’s US partners Open Source Connections, followed with his Unofficial State of Solr, detailing the history of the project, recent innovations and what might happen in the future, including some very interesting new features allowing for parallel SQL queries. We then heard from Flax team members Tom Winch and Matt Pearce on how they have built faceting improvements, a new XJoin between Solr and external systems, researched federated search and developed ontology indexers (note that all of the software they’ve built is available as open source, and Tom has recently written extensively about XJoin).

After lunch we heard from Peter Meric of the NCBI (the US equivalent of the EBI) on a Solr-based system for searching gene data, to supplement the NCBI’s homegrown Entrez system. This is very much a filtered search rather than a text search and indexes around 330m records. He also talked about a High Availability prototype of a replacement for the very high traffic PubMed service built on Amazon Web Services. Each Solr, MongoDB or Zookeeper node ‘announces’ itself using a monitor service and then replicates data from a master node. Although it is not yet available as open source I think this project may be of great interest to the wider Solr community and I hope we hear more of it soon.

Next up was a brief talk by Dan Bolser of the EBI on an ‘old school’ scheme for sharding plant phenotype data – I’d seen part of this presentation before and it’s linked to our own ideas on federating search across bioinformatics data. Dan was followed by Lewis Geer of NCBI talking about the SEQR protein similarity search engine built on Solr. Although somewhat complex for us non-biologists to understand, this very clever system relies on experimental results to suggest which of the possible variants of a protein system are likely, and adds these to the Solr index – it reminded me of a similar approach we’ve used to store possible OCR errors when working with scanned newsprint. His team’s code is available. Dan Stainer of the Ensembl project was next discussing how his team are indexing tens of thousands of genomes from thousands of species, currently on a MySQL backend with a REST API and a lot of Perl. He discussed how they have been experimenting with Elasticsearch to index around 3.2bn items, creating a 782GB index which builds in around 5-6 hours, to provide new capabilities such as structured queries for their genome browser tools.

We then held an interactive hands-on session, covering subjects such as ‘getting started with Solr’ and exploring some of the code we’ve built such as XJoin, followed by a conference dinner in Hinxton Hall. It was clear that there is a huge range of use cases for search technology in the life sciences community and almost as many different ways to address them, and the after-dinner conversation was lively and highly interesting!

Most of the presentations are now available for download and we’ve also written about the second day of the event, where we shifted focus onto Elasticsearch and other technologies.

Leave a Reply

Your email address will not be published. Required fields are marked *