Presentations – Flax

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018

Charlie Hull — Mon, 18 Jun 2018 13:53:27 +0000

I spent last week in a sunny Berlin for the Berlin Buzzwords event (and subsequently MICES 2018, of which more later). This was my first visit to Buzzwords which was held in an arts & culture complex in an old brewery north of the city centre. The event was larger than I was expecting at around 550 people with three main tracks of talks. Although due to some external meetings I didn’t attend as many talks as I would have liked, here are a few highlights. Many of the talks have slides provided and some are now also available on the Buzzwords Youtube channel.

Giovanni Fernandez-Kincade talked about query understanding to improve both recall and precision for searches. He made the point that users and documents often speak very different languages which can lead to a lack of confidence in the search engine. Various techniques are available to attempt to translate the user’s intention into a suitable query and these can be placed on a spectrum from human-powered (e.g. creating an exception list to prevent stemming of proper nouns) to some degree of automation (e.g. harvesting data to build lists of synonyms) to fully automation (machine learning of how queries map to documents). Obviously these also fit on other scales from labour-intensive to hands-off and easy to hard in terms of the technology skills required. This talk gave a solid base understanding of the techniques available.

I dropped in on Suneel Marthi’s talk on detecting tulip fields from satellite images, which was fascinating although outside my usual area of search engine technology. I then heard Nick Burch describe the many ways that text extraction powered by Apache Tika can crash your JVM or even your entire cluster (potentially expensive in an elastically-scaling situation as more resources are automatically allocated!). As he recommended one should expect failure and plan accordingly, ship logs somewhere central for analysis and never run Tika inside your Solr instance itself in a production system (a recommendation that has finally made it to the Solr Wiki). Doug Turnbull and Tommaso Teofili then spoke on The Neural Search Frontier, a wide-ranging and in some places somewhat speculative discussion of techniques to improve ranking using word embeddings described by multidimensional vectors. This approach combined traditional IR techniques with neural models to learn whether a document is relevant to a query. One fascinating idea was the use of recurrent neural networks, much used in translation applications, to ‘translate’ a document to a predicted query. As with most of Doug’s talks this gave us a lot to think about but he finished with a plea for better native vector support in Lucene-based search engines.

The next talk I heard was from Varun Thacker on Solr autoscaling which I know is a particular concern of some of our clients as their data volumes grow. These new features in Solr version 7 allow policies and preferences to be set up to govern autoscaling behaviour, where shards may be moved and new cores created automatically based on metrics such as disk space or queries-per-second. One interesting line of questioning from the audience was how to avoid replicas from ‘ping ponging’ between hosts – e.g moving from a node with low disk space to one with more disk space, but then causing a reduction in disk space on the target node, leading to another move. Usefully the autoscaling system can be set to compute a list of operations but leave execution to a human operator, which may help prevent this problem.

The next day I attended Tomás Fernández Löbbe’s talk on new replica types in Solr 7, which talked about the advantages of the ‘Master/Slave’ model for search cluster design as opposed to the standard SolrCloud ‘every node does everything’ model. The new replica types PULL and TLOG allow one to build a master/slave setup in SolrCloud, separating responsibility for indexing and searching and even choosing which type of replica to use in queries. I also heard Houston Putman talk about data analytics with Solr, describing how built-in Solr functions can carry out the type of analytics previously only possible with Apache Spark or Hadoop and avoiding the extra cost of shipping data out of Solr. Unfortunately that was the end of my conference due to some other commitments but it was great to catch up with various search people from Europe and further abroad and to enjoy what was a well-organised and interesting event.

The post Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018 appeared first on Flax.

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch

Charlie Hull — Thu, 01 Feb 2018 10:13:56 +0000

Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch from Charlie Hull

The post Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch appeared first on Flax.

Making sense of Big Data with open source search

Charlie Hull — Fri, 11 Nov 2016 16:47:24 +0000

Making sense of big data from Charlie Hull

The post Making sense of Big Data with open source search appeared first on Flax.

Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr

Charlie Hull — Wed, 10 Feb 2016 10:26:00 +0000

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.

The day started with a quick recap of the project from myself and Dr. Sameer Valenkar of the EBI. Eric Pugh, founder of Flax’s US partners Open Source Connections, followed with his Unofficial State of Solr, detailing the history of the project, recent innovations and what might happen in the future, including some very interesting new features allowing for parallel SQL queries. We then heard from Flax team members Tom Winch and Matt Pearce on how they have built faceting improvements, a new XJoin between Solr and external systems, researched federated search and developed ontology indexers (note that all of the software they’ve built is available as open source, and Tom has recently written extensively about XJoin).

After lunch we heard from Peter Meric of the NCBI (the US equivalent of the EBI) on a Solr-based system for searching gene data, to supplement the NCBI’s homegrown Entrez system. This is very much a filtered search rather than a text search and indexes around 330m records. He also talked about a High Availability prototype of a replacement for the very high traffic PubMed service built on Amazon Web Services. Each Solr, MongoDB or Zookeeper node ‘announces’ itself using a monitor service and then replicates data from a master node. Although it is not yet available as open source I think this project may be of great interest to the wider Solr community and I hope we hear more of it soon.

Next up was a brief talk by Dan Bolser of the EBI on an ‘old school’ scheme for sharding plant phenotype data – I’d seen part of this presentation before and it’s linked to our own ideas on federating search across bioinformatics data. Dan was followed by Lewis Geer of NCBI talking about the SEQR protein similarity search engine built on Solr. Although somewhat complex for us non-biologists to understand, this very clever system relies on experimental results to suggest which of the possible variants of a protein system are likely, and adds these to the Solr index – it reminded me of a similar approach we’ve used to store possible OCR errors when working with scanned newsprint. His team’s code is available. Dan Stainer of the Ensembl project was next discussing how his team are indexing tens of thousands of genomes from thousands of species, currently on a MySQL backend with a REST API and a lot of Perl. He discussed how they have been experimenting with Elasticsearch to index around 3.2bn items, creating a 782GB index which builds in around 5-6 hours, to provide new capabilities such as structured queries for their genome browser tools.

We then held an interactive hands-on session, covering subjects such as ‘getting started with Solr’ and exploring some of the code we’ve built such as XJoin, followed by a conference dinner in Hinxton Hall. It was clear that there is a huge range of use cases for search technology in the life sciences community and almost as many different ways to address them, and the after-dinner conversation was lively and highly interesting!

Most of the presentations are now available for download and we’ve also written about the second day of the event, where we shifted focus onto Elasticsearch and other technologies.

The post Better search for life sciences at the BioSolr Workshop, day 1 – Apache Lucene/Solr appeared first on Flax.

Out and about in search & monitoring – Autumn 2015

Charlie Hull — Wed, 16 Dec 2015 10:24:42 +0000

It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

Search Solutions 2015: Towards a new model of search relevance testing

Charlie Hull — Fri, 27 Nov 2015 15:53:30 +0000

Find out more about Quepid here.

Search Solutions 2015: Towards a new model of search relevance testing from Charlie Hull

The post Search Solutions 2015: Towards a new model of search relevance testing appeared first on Flax.

Elasticsearch for Westcoast – why search is never simple!

Charlie Hull — Fri, 27 Nov 2015 15:48:29 +0000

Elasticsearch for Westcoast from Charlie Hull

The post Elasticsearch for Westcoast – why search is never simple! appeared first on Flax.

FIBEP WMIC 2015 – Open source search for media monitoring with Solr

Charlie Hull — Thu, 19 Nov 2015 16:23:46 +0000

FIBEP WMIC 2015 – How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform from Charlie Hull

The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life

Charlie Hull — Fri, 16 Oct 2015 13:17:50 +0000

BioSolr – Searching the stuff of life – Lucene/Solr Revolution 2015 from Charlie Hull

The post Lucene/Solr Revolution 2015: BioSolr – Searching the stuff of life appeared first on Flax.