Introducing Luwak, a library for high-performance stored queries -

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */
MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test"))); monitor.update(mq); /* Create and load a stored query with a single term */
InputDocument doc = InputDocument.builder("doc1") .addField(textfield, document, WHITESPACE) .build(); /* Load a document (which could be a news article) */
DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */

The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.

Facebook

Google+

Twitter

14 thoughts on “Introducing Luwak, a library for high-performance stored queries”

Would I be correct in assuming that this is like ElasticSearch’s Percolator?

Thanks and nice work.

Reply ↓

charlie on December 10, 2013 at 9:51 am said:

Hi Ashwin,

Yes, sorta – and we’re planning a blog post to compare the two. Since we need particular positions information we need to modify Lucene and we didn’t fancy rebuilding Elasticsearch ourselves at the time.

Charlie

Reply ↓

Hi,

That’s clearly a great feature to add to Solr. I’ve been searching for clues to implement it myself for days and by chance stopped by your blog.
Do you think your work on lucene-solr-intervals will be included in the lucene trunk ? It’s quite hard to follow the status of LUCENE-2878 in the Jira.
I’ll give a try to your tool, which announces great performances, it seems 😉
Thanks for open sourcing it…

Reply ↓

charlie on February 4, 2014 at 10:10 am said:

yes getting the intervals code into trunk is a high priority, and we’re working on it! Glad you like Luwak, would be great to hear more about your project.

Reply ↓

Hey Charlie,
Nice work.
I am a newbie to solr and I do not know how to include your work in my project any help will be appreciated
Thanks

Reply ↓

charlie on February 13, 2014 at 8:54 am said:

There is an example project in the Github repository – try building it and let us know how you get on.

Reply ↓

Yup done with that part but in demoqueries file how should I put regular expression queries.Whenever I try to put something like “comm*” I got errors.

Reply ↓

Could you open an issue on github? https://github.com/flaxsearch/luwak/issues

Reply ↓

We’ve been looking for a solution like this on solr; will definitely give it a spin. Wondering if this has support for filter queries or searches against particular (non-default) fields?

Reply ↓

Hi Elaine,

Do let us know more about your use case directly if you like via http://www.flax.co.uk/contact/

Charlie

Reply ↓

Hi Elaine,

At the moment luwak is a standalone library, rather than integrated directly into Solr. It will work with any lucene Query object, and you can plug in one of the Solr query parsers if you like. Filter queries don’t make so much sense with luwak, as the queries are only ever run over a single document.

Reply ↓

Hi, Charlie!

I’ve built the luwak master-5.1 branch from https://github.com/flaxsearch/luwak/tree/master-5.1 and it works perfect for me.

Here are my queries and documents.
query_01 = [Toyota Camry 2002]
query_02 = [Tesla model S]
…

document_01 = [Need a BMW i528 2000 in San Francisco, CA 94105]
document_02 = [Looking for a tesla in Mountain View, CA 94043]
…

The luwak engine found:
document_02 is matched query_02

I’m impressed with the luwak performance. For instance: 1,000,000 queries against 20 documents took only ~104seconds.

So I want to say a big thank you to all of you who have worked on this project and would like to ask the following questions:

1) Can I provide a query like this to the luwak engine?
query_n = [(+honda +volvo )^2.75 (+camry +bmw )^0.75 ]

As you see, I have the power term “^2.75” in my query, that means a particular key word is 2.75 times important than other key words in the query. So my question is does the luwak engine recognize those kind of syntaxes?

2) Also I would like to add a geo param such as latitude, longtitude and raduis to queries. Is it possible? If yes, please provide tools/plugins/links/materials.

Best,
Bakhy

Reply ↓

Hi Bakhy,

The default query parser that comes with luwak just uses the standard lucene query parser, which will understand boolean syntax and boosting. Just be aware that the standard lucene scoring model doesn’t make a great deal of sense in this case, because each query is run across an index consisting of a single document, so the IDF value will always be 1. You can set the Similarity model to be used when constructing an InputDocument if you want to use a different model, and then retrieve scores using a ScoringMatcher.

The lucene query parser doesn’t support geosearches at the moment (although there is some work being done to include simple geospatial syntax, see https://issues.apache.org/jira/browse/LUCENE-6450), so if you want to add these type of queries you’ll have to write a new query parser. It should be possible to extend the standard parser to this end.

Reply ↓

At Manticore (continuation of Sphinx Search project) we’ve developed a similar technology. Here’s the benchmark comparing that to Luwak – https://manticoresearch.com/2018/04/18/percolate-queries-manticore-search-vs-luwak/

Reply ↓

14 thoughts on “Introducing Luwak, a library for high-performance stored queries”

Leave a Reply Cancel reply