Activate 2018 day 2 – AI and Search in Montreal

Charlie Hull — Wed, 07 Nov 2018 12:09:38 +0000

I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such as diversity and inclusion raised in a keynote talk. Note that videos of some of the talks is starting to appear on Lucidworks’ Youtube channel.

Steve Rowe of Lucidworks gave a talk on what’s coming in Lucene/Solr 8 – a long list of improvements and new features from 7.x releases including autoscaling of SolrCloud clusters, better cross-datacentre replication (CDCR), time routed index aliases for time-series data, new replica types, streaming expressions, a JSON query DSL, better segment merge policies..it’s clear that a huge amount of work continues to go into Solr. In 8.x releases we’ll hopefully see HTTP/2 capability for faster throughput and perhaps Luke, the Lucene Index Toolbox, becoming part of the main project.

Cassandra Targett, also of Lucidworks, spoke about the Lucene/Solr Reference Guide which is now actually part of Solr’s source code in Asciidoc format. She had attempted to build this into a searchable, fully-hyperlinked documentation source using Solr itself but this quickly ran into issues with HTML tags and maintaining correct links. Lucidworks’ own Site Search did a lot better but the result still wasn’t perfect. Work remains to be done here but encouragingly in the last few weeks there’s also been some thinking about how to better document Solr’s huge and complex test suite on SOLR-12930. As Cassandra mentioned, effective documentation isn’t always the focus of Solr committers, but it’s essential for Solr users.

The next talk I caught came from Andrzej Bialecki on Solr’s autoscaling functionality and some impressive testing he’s done. Autoscaling analyzes your Solr cluster and makes suggestions about how to restructure it – which you can then do manually or automatically using other Solr features. These features are generally tested on collections of 1 billion documents – but Andrzej has manually tested them on 1 trillion simulated documents (yes, you read that right). Now that’s some scale!

The final talk I caught before the closing keynote was Chris ‘Hossman’ Hosstetter on How to be a Solr Contributor, amusingly peppered with profanity as is his usual style. There were a number of us in the room with some small concerns about Solr patches that have not been committed, and in general about how Solr might need more committers and how this might happen, but the talk mainly focused on how to generate new patches. He also mentioned how new features can have an unexpected cost, as they must then be maintained and might have totally unexpected consequences for other parts of the platform. Some of the audience raised questions about Solr tests (some of which regularly fail) – however since the conference Mark Miller has taken the lead on this under SOLR-12801 which is encouraging.

The closing keynote by Trey Grainger brought together the threads of search and AI – and also mentioned that if anyone had some spare server capacity, it would be fun to properly test Solr at trillion-document scale…

So in conclusion how did Activate compare to its previous incarnation as Lucene/Solr Revolution? Is search really the foundation of AI? Well, the talks I attended mainly focused on Solr features, but various colleagues heard about machine learning, learning-to-rank and self-aware machines, all of which is becoming easier to implement using Lucene/Solr. However, as Doug Turnbull writes if you’re thinking of a AI for search, you should be wary of the potential cost and complexity. There are no magic robots (Kevin Watters’ robot however, is rather wonderful!).

Huge thanks must go to all at Lucidworks for putting on such a well-organised and thought-provoking event and bringing together so many Lucene/Solr enthusiasts.

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018

Charlie Hull — Mon, 18 Jun 2018 13:53:27 +0000

I spent last week in a sunny Berlin for the Berlin Buzzwords event (and subsequently MICES 2018, of which more later). This was my first visit to Buzzwords which was held in an arts & culture complex in an old brewery north of the city centre. The event was larger than I was expecting at around 550 people with three main tracks of talks. Although due to some external meetings I didn’t attend as many talks as I would have liked, here are a few highlights. Many of the talks have slides provided and some are now also available on the Buzzwords Youtube channel.

Giovanni Fernandez-Kincade talked about query understanding to improve both recall and precision for searches. He made the point that users and documents often speak very different languages which can lead to a lack of confidence in the search engine. Various techniques are available to attempt to translate the user’s intention into a suitable query and these can be placed on a spectrum from human-powered (e.g. creating an exception list to prevent stemming of proper nouns) to some degree of automation (e.g. harvesting data to build lists of synonyms) to fully automation (machine learning of how queries map to documents). Obviously these also fit on other scales from labour-intensive to hands-off and easy to hard in terms of the technology skills required. This talk gave a solid base understanding of the techniques available.

I dropped in on Suneel Marthi’s talk on detecting tulip fields from satellite images, which was fascinating although outside my usual area of search engine technology. I then heard Nick Burch describe the many ways that text extraction powered by Apache Tika can crash your JVM or even your entire cluster (potentially expensive in an elastically-scaling situation as more resources are automatically allocated!). As he recommended one should expect failure and plan accordingly, ship logs somewhere central for analysis and never run Tika inside your Solr instance itself in a production system (a recommendation that has finally made it to the Solr Wiki). Doug Turnbull and Tommaso Teofili then spoke on The Neural Search Frontier, a wide-ranging and in some places somewhat speculative discussion of techniques to improve ranking using word embeddings described by multidimensional vectors. This approach combined traditional IR techniques with neural models to learn whether a document is relevant to a query. One fascinating idea was the use of recurrent neural networks, much used in translation applications, to ‘translate’ a document to a predicted query. As with most of Doug’s talks this gave us a lot to think about but he finished with a plea for better native vector support in Lucene-based search engines.

The next talk I heard was from Varun Thacker on Solr autoscaling which I know is a particular concern of some of our clients as their data volumes grow. These new features in Solr version 7 allow policies and preferences to be set up to govern autoscaling behaviour, where shards may be moved and new cores created automatically based on metrics such as disk space or queries-per-second. One interesting line of questioning from the audience was how to avoid replicas from ‘ping ponging’ between hosts – e.g moving from a node with low disk space to one with more disk space, but then causing a reduction in disk space on the target node, leading to another move. Usefully the autoscaling system can be set to compute a list of operations but leave execution to a human operator, which may help prevent this problem.

The next day I attended Tomás Fernández Löbbe’s talk on new replica types in Solr 7, which talked about the advantages of the ‘Master/Slave’ model for search cluster design as opposed to the standard SolrCloud ‘every node does everything’ model. The new replica types PULL and TLOG allow one to build a master/slave setup in SolrCloud, separating responsibility for indexing and searching and even choosing which type of replica to use in queries. I also heard Houston Putman talk about data analytics with Solr, describing how built-in Solr functions can carry out the type of analytics previously only possible with Apache Spark or Hadoop and avoiding the extra cost of shipping data out of Solr. Unfortunately that was the end of my conference due to some other commitments but it was great to catch up with various search people from Europe and further abroad and to enjoy what was a well-organised and interesting event.

The post Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018 appeared first on Flax.

autoscaling – Flax

Activate 2018 day 2 – AI and Search in Montreal

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018