scaling – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Activate 2018 day 2 – AI and Search in Montreal http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/ http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/#respond Wed, 07 Nov 2018 12:09:38 +0000 http://www.flax.co.uk/?p=3983 I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such … More

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

]]>
I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such as diversity and inclusion raised in a keynote talk. Note that videos of some of the talks is starting to appear on Lucidworks’ Youtube channel.

Steve Rowe of Lucidworks gave a talk on what’s coming in Lucene/Solr 8 – a long list of improvements and new features from 7.x releases including autoscaling of SolrCloud clusters, better cross-datacentre replication (CDCR), time routed index aliases for time-series data, new replica types, streaming expressions, a JSON query DSL, better segment merge policies..it’s clear that a huge amount of work continues to go into Solr. In 8.x releases we’ll hopefully see HTTP/2 capability for faster throughput and perhaps Luke, the Lucene Index Toolbox, becoming part of the main project.

Cassandra Targett, also of Lucidworks, spoke about the Lucene/Solr Reference Guide which is now actually part of Solr’s source code in Asciidoc format. She had attempted to build this into a searchable, fully-hyperlinked documentation source using Solr itself but this quickly ran into issues with HTML tags and maintaining correct links. Lucidworks’ own Site Search did a lot better but the result still wasn’t perfect. Work remains to be done here but encouragingly in the last few weeks there’s also been some thinking about how to better document Solr’s huge and complex test suite on SOLR-12930. As Cassandra mentioned, effective documentation isn’t always the focus of Solr committers, but it’s essential for Solr users.

The next talk I caught came from Andrzej Bialecki on Solr’s autoscaling functionality and some impressive testing he’s done. Autoscaling analyzes your Solr cluster and makes suggestions about how to restructure it – which you can then do manually or automatically using other Solr features. These features are generally tested on collections of 1 billion documents – but Andrzej has manually tested them on 1 trillion simulated documents (yes, you read that right). Now that’s some scale!

The final talk I caught before the closing keynote was Chris ‘Hossman’ Hosstetter on How to be a Solr Contributor, amusingly peppered with profanity as is his usual style. There were a number of us in the room with some small concerns about Solr patches that have not been committed, and in general about how Solr might need more committers and how this might happen, but the talk mainly focused on how to generate new patches. He also mentioned how new features can have an unexpected cost, as they must then be maintained and might have totally unexpected consequences for other parts of the platform. Some of the audience raised questions about Solr tests (some of which regularly fail) – however since the conference Mark Miller has taken the lead on this under SOLR-12801 which is encouraging.

The closing keynote by Trey Grainger brought together the threads of search and AI – and also mentioned that if anyone had some spare server capacity, it would be fun to properly test Solr at trillion-document scale…

So in conclusion how did Activate compare to its previous incarnation as Lucene/Solr Revolution? Is search really the foundation of AI? Well, the talks I attended mainly focused on Solr features, but various colleagues heard about machine learning, learning-to-rank and self-aware machines, all of which is becoming easier to implement using Lucene/Solr. However, as Doug Turnbull writes if you’re thinking of a AI for search, you should be wary of the potential cost and complexity. There are no magic robots (Kevin Watters’ robot however, is rather wonderful!).

Huge thanks must go to all at Lucidworks for putting on such a well-organised and thought-provoking event and bringing together so many Lucene/Solr enthusiasts.

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/11/07/activate-2018-day-2-ai-and-search-in-montreal/feed/ 0
London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/ http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/#respond Thu, 08 Feb 2018 14:55:22 +0000 http://www.flax.co.uk/?p=3688 This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you! Our first talk was from Uwe … More

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you!

Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.

Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.

Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/feed/ 0
Lucene Revolution 2016, Boston http://www.flax.co.uk/blog/2016/10/26/lucene-revolution-2016-boston/ http://www.flax.co.uk/blog/2016/10/26/lucene-revolution-2016-boston/#respond Wed, 26 Oct 2016 15:44:48 +0000 http://www.flax.co.uk/?p=3373 After our two successful hackdays, it was on to the main event of the week and the largest open source search event of the year. In between catching up with other Lucene/Solr folks on the first day I enjoyed Chris … More

The post Lucene Revolution 2016, Boston appeared first on Flax.

]]>
After our two successful hackdays, it was on to the main event of the week and the largest open source search event of the year. In between catching up with other Lucene/Solr folks on the first day I enjoyed Chris ‘Hossman’ Hostetter’s talk on Hidden Gems of Apache Solr with some great tips on obscure Solr query syntax, and Bloomreach’s fast-paced talk on the SolrCloud Rebalance API which allows one to autoscale large Solr systems (although this feature isn’t quite yet available in Solr 6, we’re promised it’s being worked on). I then had a pleasant partner lunch with Lucidworks and heard about some exciting developments for their Fusion search platform – we can expect to see version 3 of this Solr-based product soon and I’ll be blogging more about this in coming months. Dragan Milosevic‘s talk on how aggregration performance compares between various systems including Elasticsearch and Solr was slightly hamstrung by his laptop failing, but he bravely carried on and led us to the (unsurprising) conclusion that both perform pretty well but working with HBase directly can be faster due to its single index. The day finished with a party held on the 50th floor of a nearby building, with fantastic views over the city.

On the second day I caught Scott Blum‘s talk on another autoscaling strategy for Solr. They have a 13 billion document index running on 6000 Solr cores on 32 nodes so have had issues with JVM garbage collection which they initially solved with a manual reload of each core every few hours. They eventually built Solrman which can automatically balance Solr cores across a set of nodes – this is a solid alternative to the Rebalance API and we’ll be looking at both for some of our clients soon. We followed Scott with our talk on Coffee, Danish and Search (Powerpoint slides are here and video here) about our work on a multilingual media monitoring system for Infomedia which was well received. Having stepped out for a minute I was unable to get back in the room for Kevin Watters‘ talk on the Solr Graph Query which was extremely popular! Grant Ingersoll finished the conference with a closing keynote and we then headed for the airport.

Thanks to Lucidworks and the conference sponsors for another great event – it felt busier than previous years and this is more evidence of the ongoing healthy state of the Lucene/Solr community. We’ll be back!

The post Lucene Revolution 2016, Boston appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/10/26/lucene-revolution-2016-boston/feed/ 0
Running out of disk space with Elasticsearch and Solr: a solution http://www.flax.co.uk/blog/2016/04/21/running-disk-space-elasticsearch-solr/ http://www.flax.co.uk/blog/2016/04/21/running-disk-space-elasticsearch-solr/#comments Thu, 21 Apr 2016 09:44:35 +0000 http://www.flax.co.uk/?p=3241 We recently did a proof-of-concept project for a customer which ingested log events from various sources into a Kafka – Logstash – Elasticsearch – Kibana stack. This was configured with Ansible and hosted on about a dozen VMs inside the customer’s main … More

The post Running out of disk space with Elasticsearch and Solr: a solution appeared first on Flax.

]]>
We recently did a proof-of-concept project for a customer which ingested log events from various sources into a KafkaLogstashElasticsearchKibana stack. This was configured with Ansible and hosted on about a dozen VMs inside the customer’s main network.

For various reasons resources were tight. One problem which we ran into several times was running out of disk space on the Elasticsearch nodes (this was despite setting up Curator to delete older indexes, and increasing the available storage as much as possible). Like most software, Elasticsearch does not always handle this situation gracefully, and we often had to ssh in and manually delete index files to get the system working again.

As a result of this experience, we have written a simple proxy server which can detect when an Elasticsearch or Solr cluster is close to running out of storage, and reject any further updates with a configurable error (503 Unavailable would seem to be the most appropriate) until enough space is freed up for indexing to continue. We call this Hara Hachi Bu, after the Confucian teaching to only eat until you are 80% full. It is available to download on GitHub and has the Apache 2.0 license. This is a very early release and we would welcome feedback or contributions. Although we have tested it with Elasticsearch and Solr, it should be adaptable to any data store with a RESTful API.

Technical stuff

The server is implemented using DropWizard (version 0.9.2), a framework we’ve used a lot for its ease of use and configurability. It is intended to sit between an indexer and your search engine (or a similar disk-based data store), and will check that disk space is available when requesting certain endpoints. If the disk space is less than a configured threshold value, the request will be rejected with a configurable HTTP status code.

There are disk space checkers for Elasticsearch (using the /_cluster/stats endpoint), a local Solr installation, or a cluster of hosts. If using a cluster, each machine is required to regularly post its disk space to the application. Custom implementations can also be added, by implementing the DiskSpaceChecker interface.

The trickiest part of the implementation was to allow DropWizard endpoints through without them being proxied. We did this by implementing both a filter and a servlet – the filter looks out for locally known endpoints and passes them straight through, while unknown endpoints have a /proxy prefix added to the URL path and then caught by the proxy servlet. The filter also carries out the disk space check on URLs in the check list, allowing them to be rejected before reaching the servlet. (If you’ve come up with a different solution to this problem, we’d be interested to hear about it.)

The proxy was implemented by extending the Jetty ProxyServlet (http://www.eclipse.org/jetty/documentation/current/proxy-servlet.html) – this allowed us to override a single method in order to implement our proxy, stripping off the /proxy prefix and redirecting the request to the configured host and port.

Internally, the application will build the DiskSpaceChecker defined in the configuration. DropWizard resources (or endpoints) and health checks are added depending on the implementation, with a default, generic health check which simply checks whether or not disk space is currently available. The /setSpace resource is only available when using the clustered configuration, for example.

The post Running out of disk space with Elasticsearch and Solr: a solution appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/21/running-disk-space-elasticsearch-solr/feed/ 2
Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others http://www.flax.co.uk/blog/2016/02/15/better-search-life-sciences-biosolr-workshop-day-2-elasticsearch-others/ http://www.flax.co.uk/blog/2016/02/15/better-search-life-sciences-biosolr-workshop-day-2-elasticsearch-others/#respond Mon, 15 Feb 2016 11:32:13 +0000 http://www.flax.co.uk/?p=3017 Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown … More

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

]]>
Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.

The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.

Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.

Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.

We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.

Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/02/15/better-search-life-sciences-biosolr-workshop-day-2-elasticsearch-others/feed/ 0
Out and about in search & monitoring – Autumn 2015 http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/ http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/#respond Wed, 16 Dec 2015 10:24:42 +0000 http://www.flax.co.uk/?p=2857 It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress … More

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/feed/ 0
Enterprise Search Europe 2015: Fishing the big data streams – the future of search http://www.flax.co.uk/blog/2015/10/28/enterprise-search-europe-2015-fishing-the-big-data-streams-the-future-of-search/ http://www.flax.co.uk/blog/2015/10/28/enterprise-search-europe-2015-fishing-the-big-data-streams-the-future-of-search/#respond Wed, 28 Oct 2015 12:09:52 +0000 http://www.flax.co.uk/?p=2755 Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

]]>

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/10/28/enterprise-search-europe-2015-fishing-the-big-data-streams-the-future-of-search/feed/ 0
Elasticsearch Meetup – Spark, postcodes and Couchbase http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/ http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/#respond Fri, 25 Sep 2015 13:30:52 +0000 http://www.flax.co.uk/?p=2669 Three speakers for this month’s Elasticsearch Meetup (slides now up), kindly hosted by JustEat’s technical department. Neil Andrassy kicked us off with a talk about how TheFilter (which you may know counts Peter Gabriel as an investor) use Apache Spark … More

The post Elasticsearch Meetup – Spark, postcodes and Couchbase appeared first on Flax.

]]>
Three speakers for this month’s Elasticsearch Meetup (slides now up), kindly hosted by JustEat’s technical department. Neil Andrassy kicked us off with a talk about how TheFilter (which you may know counts Peter Gabriel as an investor) use Apache Spark to load data into their Elasticsearch cluster. Neil described how Spark and Elasticsearch have superseded both Microsoft SQL and MongoDB – Spark in particular being described as ‘speedy, flexible and componentized’, with Spark’s RDD (Resilient Distributed Datasets) mapping cleanly to Elasticsearch shards. He then showed a demo of UK road accident data being imported into Spark as CSV files, indexed automatically in Elasticsearch and then queried both using Elasticsearch and by Spark’s SQL-like facility. Interestingly, this allows a powerful combination of free text search and relational JOINs to be applied to data in a highly scalable fashion – Spark also features machine learning and streaming data components.

After a quick plug for ElastiCON in London in November, Matt Jones of JustEat described how they have used Elasticsearch’s geolocation search function to improve their handling of restaurant delivery areas. Their previous system only handled the first part of postcodes (e.g ‘SE1’) and they needed finer-grained control of the areas that restaurants were able to deliver to. By indexing polygons representing UK postcode areas and combining these with custom shapes (i.e. a circle representing a maximum delivery distance) they have created a powerful and extendable way to restrict search results. Matt has blogged about this in more detail.

The last talk was by Tom Green of Couchbase, who described how this powerful NoSQL platform is architected and how it can be connected directly to Elasticsearch using its own Cross Data Centre Replication (XDCR) feature. We finished with the usual Q&A during which Mark Harwood responded to my own question on exact facet counts in Elasticsearch with a plea to the industry to be more honest about the limitations of distributed systems – much like the CAP theorem, perhaps we need a similar triangle with vertices of Big Data, Speed and Accuracy – pick two!

Thanks as ever to all the speakers and the hosts, and to Yann Cluchey for organising the Meetup.

The post Elasticsearch Meetup – Spark, postcodes and Couchbase appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/09/25/elasticsearch-meetup-spark-postcodes-and-couchbase/feed/ 0
A new Meetup for Lucene & Solr http://www.flax.co.uk/blog/2014/12/01/a-new-meetup-for-lucene-solr/ http://www.flax.co.uk/blog/2014/12/01/a-new-meetup-for-lucene-solr/#respond Mon, 01 Dec 2014 13:41:02 +0000 http://www.flax.co.uk/blog/?p=1319 Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided … More

The post A new Meetup for Lucene & Solr appeared first on Flax.

]]>
Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

The post A new Meetup for Lucene & Solr appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/12/01/a-new-meetup-for-lucene-solr/feed/ 0
BioSolr begins with a workshop day http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/ http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/#respond Thu, 02 Oct 2014 12:11:47 +0000 http://www.flax.co.uk/blog/?p=1279 Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various … More

The post BioSolr begins with a workshop day appeared first on Flax.

]]>
Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

The post BioSolr begins with a workshop day appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2014/10/02/biosolr-begins-with-a-workshop-day/feed/ 0