A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.
We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:
Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */
MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq); /* Create and load a stored query with a single term */
InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build(); /* Load a document (which could be a news article) */
DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */
The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.