NOT WITHIN queries in Lucene

A guest post from Alan Woodward who has joined the Flax team recently:

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects. dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for “fish” within two positions of “chips”, you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2.

You can also search for terms that are not within X positions of another term. This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is “fish” NOT WITHIN/2 “chips”. A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal “fish” WITHIN/2 “chips” query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for “George” within 2 positions of “Bush”, not overlapping with “H”.

In our specific case, we want to find instances of “fish” that do not overlap with Spans of “fish” within/2 “chips”. So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.

Tags: , , , ,

This entry was posted on Wednesday, February 22nd, 2012 at 9:29 am and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to “NOT WITHIN queries in Lucene”

  1. Nice article….thats what i am looking for.
    Can you explain me how to integrate it in lucene/Solr query syntex ?
    Right now i am planing to use surround query parser for it,if you have any better idea then well come.

  2. I’m afraid the standard Solr query syntax doesn’t support Span queries.

  3. Hi Charlie ,
    Thanks for response… i have done workarround that one, i used within term query in phrase quries..for that i have customized solrquery parser according.

    again thanks for such nice article.

Leave a Reply

  • « Older Entries
  • Newer Entries »