query – Flax

London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr

Charlie Hull — Fri, 02 Jun 2017 14:31:32 +0000

Bloomberg kindly hosted the London Lucene/Solr Meetup last night and we were lucky enough to have two excellent speakers for the thirty or so attendees. René Kriegler kicked off with a talk about the Querqy library he has developed to provide a pre-processing layer for Solr (and soon, Elasticsearch) queries. This library was originally developed during a project for Germany’s largest department store Galeria Kaufhof and allows users to add a series of simple rules in a text file to raise or lower results containing certain words, filter out certain results, add synonyms and decompound words (particularly important for German!). We’ve seen similar rules-based systems in use at many of our e-commerce clients, but few of these work well with Solr (Hybris in particular has a poor integration with Solr and can produce some very strange Solr queries). In contrast, Querqy is open source and designed by someone with expert Solr knowledge. With the addition of a simple UI or an integration with a relevancy-testing framework such as Quepid, this could be a fantastic tool for day-to-day tuning of search relevance – without the need for Solr expertise. You can find Querqy on Github.

Michael Suzuki of Alfresco talked next about the importance of being bilingual (actually he speaks 4 languages!) and how new features in Solr version 6 allow one to use either Solr syntax, SQL expressions or a combination of both. This helps hide Solr’s complexity and also allows easy integration with database administration and reporting tools, while allowing use of Solr by the huge number of developers and database administrators familiar with SQL syntax. Using a test set from the IMDB movie archive he demonstrated how SQL expressions can be used directly on a Solr index to answer questions such as ‘what are the highest grossing film actors’. He then used visualisation tool Apache Zeppelin to produce various graphs based on these queries and also showed dbVisualizer, a commonly used database administration tool, connecting directly to Solr via JDBC and showing the index contents as if they were just another set of SQL tables. He finished by talking briefly about the new statistical programming features in Solr 6.6 – a powerful new development with features similar to the R language.

We continued with a brief Q&A session . Thanks to both our speakers – we’ll be back again soon!

The post London Lucene/Solr Meetup: Query Pre-processing & SQL with Solr appeared first on Flax.

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

Charlie Hull — Fri, 24 Apr 2015 10:26:34 +0000

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

The post Lucene/Solr London Meetup – BioSolr and Query Deep Dive appeared first on Flax.

NOT WITHIN queries in Lucene

Charlie Hull — Wed, 22 Feb 2012 09:29:00 +0000

A guest post from Alan Woodward who has joined the Flax team recently:

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects. dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for “fish” within two positions of “chips”, you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2.

You can also search for terms that are not within X positions of another term. This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is “fish” NOT WITHIN/2 “chips”. A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal “fish” WITHIN/2 “chips” query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for “George” within 2 positions of “Bush”, not overlapping with “H”.

In our specific case, we want to find instances of “fish” that do not overlap with Spans of “fish” within/2 “chips”. So to create our query, we need the following:

int distance = 2; boolean ordered = true; SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish")); SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips")); SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips }, distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.

The post NOT WITHIN queries in Lucene appeared first on Flax.