Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem

We've been working on a number of projects recently involving open source software often quoted as 'Big Data' solutions - here's a quick overview of them. The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrot...Continue reading

Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Over the last 18 months we've been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. ...Continue reading

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

BioSolr_logo

          Over the last 18 months we've been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Continue reading

The fun and frustration of writing a plugin for Elasticsearch for ontology indexing

As part of our work on the BioSolr project, I have been continuing to work on the various Elasticsearch ontology annotation plugins (note that even though the project started with a focus on Solr - thus the name - we have also been developing some features for Ela...Continue reading

XJoin for Solr, part 1: filtering using price discount data

In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I'll show how we can use this to solve a common problem in e-commerce - how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

...Continue reading

Elasticsearch vs. Solr: performance improvements

I had been planning not to continue with these posts, but after Matt Weber pointed out the github pull requests (which to my embarrassment I'd not even noticed) he'd made to address some methodological flaws, another attempt was the least I could do. For Solr there was a slight reduction in mean search time, from 39ms (for my original, suboptimal query structure) to 34ms and median search time from 27ms to 25ms - see figure 1. Elasticsearch, on the ...Continue reading

Elasticsearch London Meetup: Templates, easy log search & lead generation

After a long day at a Real Time Analytics event (of which more later) I dropped into the Elasticsearch London User Group, hosted by Red Badger and provided with a ridiculously huge amount of pizza (I have a theory that you'll be able to spot an Elasticsearch developer in a few years by the size of their pizza-filled belly). ...Continue reading

Solr geolocation searches using WKT – latitude or longitude first?

Matt Pearce writes: We have been working with a client who needs to search for documents based on location, either using a single point or (sometimes very) complex polygons. They supplied the location data in WKT format which we assumed we could feed directly into our search engine (in this case Solr) without any modifications being necessary. Then we started testing the location se...Continue reading

Enterprise Search Europe 2014 day 2 – futures, text mining and images

Staying over in London due to the aforementioned tube strike proved to be a good idea and a large fried breakfast an even better one, so I arrived at the second day of the conference right on time and ready for the second day's keynote by Jeff Fried of BA Insight and Professor Elaine Toms from Sheffield University, who hadn't met before the event but spoke in turn on the Future ...Continue reading