Introducing Luwak, a library for high-performance stored queries

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */

MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq); /* Create and load a stored query with a single term */

InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build(); /* Load a document (which could be a news article) */

DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */

The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.

14 thoughts on “Introducing Luwak, a library for high-performance stored queries

    • Hi Ashwin,

      Yes, sorta – and we’re planning a blog post to compare the two. Since we need particular positions information we need to modify Lucene and we didn’t fancy rebuilding Elasticsearch ourselves at the time.

      Charlie

  1. Hi,

    That’s clearly a great feature to add to Solr. I’ve been searching for clues to implement it myself for days and by chance stopped by your blog.
    Do you think your work on lucene-solr-intervals will be included in the lucene trunk ? It’s quite hard to follow the status of LUCENE-2878 in the Jira.
    I’ll give a try to your tool, which announces great performances, it seems 😉
    Thanks for open sourcing it…

  2. Hey Charlie,
    Nice work.
    I am a newbie to solr and I do not know how to include your work in my project any help will be appreciated
    Thanks

  3. Yup done with that part but in demoqueries file how should I put regular expression queries.Whenever I try to put something like “comm*” I got errors.

  4. We’ve been looking for a solution like this on solr; will definitely give it a spin. Wondering if this has support for filter queries or searches against particular (non-default) fields?

  5. Hi Elaine,

    At the moment luwak is a standalone library, rather than integrated directly into Solr. It will work with any lucene Query object, and you can plug in one of the Solr query parsers if you like. Filter queries don’t make so much sense with luwak, as the queries are only ever run over a single document.

  6. Hi, Charlie!

    I’ve built the luwak master-5.1 branch from https://github.com/flaxsearch/luwak/tree/master-5.1 and it works perfect for me.

    Here are my queries and documents.
    query_01 = [Toyota Camry 2002]
    query_02 = [Tesla model S]

    document_01 = [Need a BMW i528 2000 in San Francisco, CA 94105]
    document_02 = [Looking for a tesla in Mountain View, CA 94043]

    The luwak engine found:
    document_02 is matched query_02

    I’m impressed with the luwak performance. For instance: 1,000,000 queries against 20 documents took only ~104seconds.

    So I want to say a big thank you to all of you who have worked on this project and would like to ask the following questions:

    1) Can I provide a query like this to the luwak engine?
    query_n = [(+honda +volvo )^2.75 (+camry +bmw )^0.75 ]

    As you see, I have the power term “^2.75” in my query, that means a particular key word is 2.75 times important than other key words in the query. So my question is does the luwak engine recognize those kind of syntaxes?

    2) Also I would like to add a geo param such as latitude, longtitude and raduis to queries. Is it possible? If yes, please provide tools/plugins/links/materials.

    Best,
    Bakhy

  7. Hi Bakhy,

    The default query parser that comes with luwak just uses the standard lucene query parser, which will understand boolean syntax and boosting. Just be aware that the standard lucene scoring model doesn’t make a great deal of sense in this case, because each query is run across an index consisting of a single document, so the IDF value will always be 1. You can set the Similarity model to be used when constructing an InputDocument if you want to use a different model, and then retrieve scores using a ScoringMatcher.

    The lucene query parser doesn’t support geosearches at the moment (although there is some work being done to include simple geospatial syntax, see https://issues.apache.org/jira/browse/LUCENE-6450), so if you want to add these type of queries you’ll have to write a new query parser. It should be possible to extend the standard parser to this end.

Leave a Reply

Your email address will not be published. Required fields are marked *