Introducing Luwak, a library for high-performance stored queries

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */

MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq); /* Create and load a stored query with a single term */

InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build(); /* Load a document (which could be a news article) */

DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */

The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.

Tags: , , , , ,

This entry was posted on Friday, December 6th, 2013 at 2:24 pm and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

8 Responses to “Introducing Luwak, a library for high-performance stored queries”

  1. Would I be correct in assuming that this is like ElasticSearch’s Percolator?

    Thanks and nice work.

  2. Hi Ashwin,

    Yes, sorta – and we’re planning a blog post to compare the two. Since we need particular positions information we need to modify Lucene and we didn’t fancy rebuilding Elasticsearch ourselves at the time.

    Charlie

  3. Hi,

    That’s clearly a great feature to add to Solr. I’ve been searching for clues to implement it myself for days and by chance stopped by your blog.
    Do you think your work on lucene-solr-intervals will be included in the lucene trunk ? It’s quite hard to follow the status of LUCENE-2878 in the Jira.
    I’ll give a try to your tool, which announces great performances, it seems ;-)
    Thanks for open sourcing it…

  4. yes getting the intervals code into trunk is a high priority, and we’re working on it! Glad you like Luwak, would be great to hear more about your project.

  5. Hey Charlie,
    Nice work.
    I am a newbie to solr and I do not know how to include your work in my project any help will be appreciated
    Thanks

  6. There is an example project in the Github repository – try building it and let us know how you get on.

  7. Yup done with that part but in demoqueries file how should I put regular expression queries.Whenever I try to put something like “comm*” I got errors.

  8. Could you open an issue on github? https://github.com/flaxsearch/luwak/issues

Leave a Reply

  • « Older Entries
  • Newer Entries »