Posts Tagged ‘query’

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

NOT WITHIN queries in Lucene

A guest post from Alan Woodward who has joined the Flax team recently:

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects. dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for “fish” within two positions of “chips”, you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2.

You can also search for terms that are not within X positions of another term. This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is “fish” NOT WITHIN/2 “chips”. A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal “fish” WITHIN/2 “chips” query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for “George” within 2 positions of “Bush”, not overlapping with “H”.

In our specific case, we want to find instances of “fish” that do not overlap with Spans of “fish” within/2 “chips”. So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.

Tags: , , , ,

Posted in Technical

February 22nd, 2012

3 Comments »