As part of our BioSolr project, we’ve been discussing how best to create a federated search over several Apache Solr instances. In this case various research institutions across the world are annotating data objects representing proteins and it would be useful to search not just the original protein data, but what others have added to the body of knowledge. If an institution wants to use the annotations, the usual approach is to download the extra data regularly and add it into a local Solr index.
Luckily Solr is widely used in the bioinformatics community so we have commonality in the query API. The question is would it be possible to use some of the distributed querying capabilities of SolrCloud to search not just the shards of a single index, but a group of Solr/SolrCloud indices – a supercluster.
This is a bit like a standard federated search, where queries are farmed out to various disparate search engines and the results then combined and displayed in some fashion. However, since we are sharing a single technology, powerful features such as result grouping would be possible.
For this to work at all, there would need to be some agreed standard between the various Solr systems: a globally unique record identifier for example (possibly implemented with a prefix unique to each institution). Any data that was required for result grouping would have to share a schema across the entire supercluster – let’s call this the primary schema – but basic searching and faceting could still be carried out over data with a differing, secondary schema. Solr dynamic fields might be useful for this secondary schema.
Luckily, research institutions are used to working as part of a consortium, and one of the conditions for joining would be agreeing to some common standards. A single Solr query API would then be available to all members of the consortium, to search not just their own data but everything available from their partners, without the slow and error-prone process of copying the data for local indexing.
We’re currently evaluating the feasibility of this idea and would welcome input from others – let us know what you think in the comments!