Posts Tagged ‘lucene’

Autumn events roundup – ESS DC, Solr vs Elasticsearch & a new Meetup

It’s looking like a busy Autumn for search events – first, I’m presenting at Enterprise Search & Discovery 2014 in Washington DC on November 5th, talking about ‘Turning Search Upside Down with open source software’. I’ll be describing how we’ve replaced various underperforming, big name closed source search engines with faster & more scalable open source technology, including our own Luwak stored query engine. Do let me know if you’re in DC, I’d be very happy to meet up. The week after this is Lucene Revolution, which sadly we won’t be attending this year, but it is recommended if you’re interested in Lucene and Solr.

Towards the end of November there’s Search Solutions, a great day of presentations about all aspects of search held at the British Computer Society in Covent Garden. This year Tom Mortimer from Flax will be presenting some research we’ve done into performance comparisons between Lucene/Solr and Elasticsearch, and there are also presentations from Thomson Reuters, the British Library, Microsoft, Yahoo! and Google. I highly recommend this event, it’s always worth attending.

We’re also starting a new Meetup in London, a group for users of Apache Lucene/Solr (there’s an Elasticsearch London user group but strangely no equivalent for the other popular stack). Our first event is on November 28th, kindly hosted by Bloomberg (who are no strangers to Lucene/Solr themselves) and featuring Shalin Mangar, a Lucene/Solr committer from Lucidworks who is visiting Europe that week. We’re hoping that we can run these events every few months, but we need help from the community, so if you could talk, sponsor or host the Meetups do let us know.

In December we’ll be holding another Cambridge Search Meetup and will be talking about our work with the European Bioinformatics Institute on the BioSolr project – the date to be confirmed. Busy times!

BioSolr begins with a workshop day

Last Thursday we attended a workshop day at the European Bioinformatics Institute as part of our joint BioSolr project. This was an opportunity for us to give some talks on particular aspects of Apache Lucene/Solr and hear from the various teams there on how they are using the software. The workshop was oversubscribed – it seems that there are even more people interested in Solr on the Wellcome Campus than we thought! We were also happy to welcome Giovanni Tummarello from Siren Solutions in Galway, Ireland and Lewis Geer from the EBI’s sister organisation in the USA, the NCBI.

We started with a brief introduction to BioSolr from Dr. Sameer Velankar and Flax then talked on Best Practices for Indexing with Solr. Based very much on our own experience and projects, we showed how although Solr’s Data Import Handler can be used to carry out many of the various tasks necessary to import, convert and process data, we prefer to write our own indexing systems, allowing us to more easily debug complex indexing tasks and protect the system from less stable external processing libraries. We then moved on to a presentation on Distributed Indexing, describing the older master/slaves technique and the more modern SolrCloud architecture we’ve used for several recent projects. We finished the morning’s talks with a quick guide to how to migrate from Apache Lucene to Apache Solr (which of course uses Lucene under the hood but is a much easier and full featured system to work with).

After lunch and some networking, we gave a further short presentation on comparing Elasticsearch to Solr, as some teams at the EBI have been considering its use. We then heard from Giovanni on Siren Solutions‘ innovative method for indexing heirarchical data with Solr using XML. His talk mentioned how by encoding tree positions directly within the index, far fewer Solr documents need to be created, with an index size reduction of 50% and up to twice the query speed. Siren have recently released open source plugins for both Solr and Elasticsearch based on this idea which are certainly worth investigating.

Following this talk, Lewis Geer described how the NCBI have built a large scale bioinformatics search platform backed both by Solr, built on commodity hardware and supporting up to 500 queries per second. To enable queries using various methods (Solr, SQL or even BLAST) they have built their own internal query language, standard result schemas and also collaborated with Heliosearch to develop improved JOIN facilities for Solr. The latter is a very exciting development as JOINs are heavily used in bioinformatics queries and we believe these features (made available recently as Solr patches) can be of use to the EBI as well. We’ll be investigating further how we can both use these features and help them to be committed to Solr.

Next were a collection of short talks from various teams from the Wellcome campus on how they were using Solr, Lucene and related tools. We heard from the PDBE, SPOT, Ensembl, UniProt, Sanger Core Services and Literature Services on a varied range of use cases, from searching proteins using Solr to scientific papers using Lucene. It was clear that we’ve still only scratched the surface of what is being done with both Lucene and Solr, and as the project progresses we hope to be able to generate repositories of useful software, documentation, best practises, guidance on migration and scaling and also learn a huge amount more about how search can be used in bioinformatics.

Over the next few weeks members of the Flax team will be visiting the EBI to work directly with the PDB and SPOT teams, to find out where we might be most effective. We’ll also be running Solr user group meetings at both the EBI and in Cambridge, of which more details soon. Do let us know if you’re interested! Thanks to the EBI for hosting the workshop day and of course the BBSRC for funding the BioSolr project.

Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz

This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem to affect the attendance too much. We started with a keynote from Dale Roberts, whose book on Decision Sourcing inspired a talk about a ‘rational decision making model’. When examining traditional relational database applications Dale said ‘if you peer at it long enough you can see the rows and columns’ and his point was that modern consumer social networking applications don’t exhibit this old pattern – so this is where search application designers should look for inspiration. His co-presenter Rooven Pakkiri said that Enterprise Search should attempt to ‘release the information from inside our heads’, which of course social networking might help with, connecting you with colleagues. I’m not sure that one can easily take lessons learnt from consumer applications and apply them to business use, and some later speakers agreed with me, but this was a high-energy and thought-provoking start.

Next I chaired the Open Source track, where we started with Cedric Ulmer of France Labs, who talked about a search application they built for a consultancy business with around 40 employees. Using Apache Solr, Apache ManifoldCF and their own Datafari open source framework they turned this project around very quickly – interestingly, the end clients needed no training to use the new system, which implies a very well designed UI. Our second talk from Ronald Hobbs of Reed Business International described a project on a much larger scale: 100 million documents, 72 business units and up to 190 queries per second – this was originally served by the FAST ESP engine but they moved to an Apache Solr system, replacing the FAST processing pipeline with Search Technologies Aspire project. His five steps for an effective migration (Prepare, Get the right tools, Get the right team, Migrate in chunks, Clean up) I can only agree with from our own experience of such projects, including one from FAST ESP to Solr. I was amused by his description of the Apache Zookeeper project as ‘a bipolar manic depressive’, although it seemed this was eventually overcome with a successful deployment on Amazon EC2. Next was Galina Hinova of Intrafind on a aftersales search application for MAN Truck and Bus – again at serious scale (MAN have around 1 billion vehicles in existence with 100-150 documents related to each). Interestingly the Euro6 regulations for emissions and standardized EU terms for automobile parts were direct drivers of the project, with Apache Lucene as the base technology. No longer is open source search just for small-scale projects it seems!

After a short break during which I chatted to John Newton, founder of Documentum Alfresco, and his team we returned to hear Dan Jackson give a description of how UCL had improved their website search – with a chaotic mix of low quality content and an ‘awful’ content management system, the challenges were myriad but with the help of experts such as our associate Tony Russell-Rose they have made significant improvements. Next was what was to prove a very popular talk from Nick Brown of AstraZeneca on a huge, well funded project to build applications to support research and development – again, this was at large scale with 75 million documents (including ‘all the patents and all the research papers’). The key here was their creation of many well-targeted ‘apps’ to enable particular uses of the Sinequa search engine they chose for the back end, including mobile apps to help find others in the company (or external to it) who are also working on a particular drug or disease. This presentation showed just what can be achieved if companies really understand the potential of search technology – knowledge sharing and discovery of previously unknown information.

After a short drinks reception we retired to a nearby pub for the combined Cambridge and London Search Meetup – I’d prepared a short quiz (feel free to have a go!) which was won by Tony Russell-Rose’s team. Networking and chatting continued long into the evening, with some people from the wider UK search community also attending.

To be continued! You can see most of the slides here.

As Hadoop gains, does Lucene benefit?

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

London Search Meetup – Serious Solr at Bloomberg & Elasticsearch 1.0

The financial information service Bloomberg hosted last Friday’s London Search Meetup in their offices on Finsbury Square – the venue had to be seen to be believed, furnished as it is with neon, chrome, modern art and fishtanks. A slight step up from the usual room above a pub! The first presenter was Ramkumar Aiyengar of Bloomberg on their new search system, accessed via the Bloomberg terminal (as it seems is everything else – Ramkumar even opened his presentation file and turned off notifications from his desk phone from within this application).

Make no mistake, Bloomberg’s requirements are significant: 900,000 new stories from 75,000 sources and 8 million manual searches every day with another 350,000 stored searches running automatically. Some of these stored searches are Boolean expressions with up to 20,000 characters and the source data is also enhanced with keywords from a list of over a million tags. Access Control Lists (ACLs) for security and over 40 languages are also supported, with new stories becoming searchable within 100ms. What is impressive is that these requirements are addressed using the open source Apache Lucene/Solr engine running 256 index shards, replicated 4 times for a total of 1024 cores, on a farm of 32 servers each with 256GB of RAM. It’s interesting to wonder if many closed source search engines could cope at all at this scale, and slightly scary to think how much it might cost!

Ramkumar explained how achieving this level of performance had led them to expose (and help to fix) quite a few previously unknown race conditions in Solr. His team had also found innovative ways to cope with such a large number of tags – each has a confidence value, say 70%, and this can be used to perform a kind of TF/IDF ranking by effectively adding 70 copies of the tag to a document. They have also developed an XML-based query parser for their in-house query syntax (althought in the future the JSON format may be used) and have contributed code back to Solr (for those interested, Bloomberg have contributed to SOLR-839 and are also looking at SOLR-4351).

For the monitoring requirement, we were very pleased to hear they are building an application based on our own Luwak stored query engine, which we developed for just this sort of high-performance application – we’ll be helping out where we can. Other future plans include relevance improvements, machine translation, entity search and connecting to some of the other huge search indexes running at Bloomberg, some on the petabyte scale.

Next up was Mark Harwood of Elasticsearch with an introduction to some of the features in version 1.0 and above. I’d been lucky enough to see Mark talk about some of these features a few weeks before so I won’t repeat myself here, but suffice it to say he again demonstrated the impressive new Aggregrations feature and raised the interesting possibility of market analysis by aggregating over a set of logged queries – identifying demand from what people are searching for.

Thanks to Bloomberg, Ramkumar, Mark and Tyler Tate for a fascinating evening – we also had a chance to remind attendees of the combined London & Cambridge Search Meetup on April 29th to coincide with the Enterprise Search Europe conference (note the discount code!).

ElasticSearch London Meetup – a busy and interesting evening!

I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene’s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

The closed-source topping on the open-source Elasticsearch

Today Elasticsearch (the company, not the software) announced their first commercial, closed-source product, a monitoring plugin for Elasticsearch (the software, not the company – yes I know this is confusing, one might suspect deliberately so). Amongst the raft of press releases there are a few small liberties with the truth, for example describing Elasticsearch (the company) as ‘founded in 2012 by the people behind the Elasticsearch and Apache Lucene open source projects’ – surely the latter project was started by Doug Cutting, who isn’t part of the aforementioned company.

Adding some closed-source dusting to a popular open-source distribution is nothing new of course – many companies do it, especially those that are venture funded – it’s a way of building intellectual property while also taking full advantage of the open-source model in terms of user adoption. Other strategies include curated distributions such as that offered by Heliosearch, founded by Solr creator Yonik Seeley and our partner LucidWorks‘ complete packaged search applications. It can help lock potential clients into your version of the software and your vision of the future, although of course they are still free to download the core and go it alone (or engage people like us to help do so), which helps them retain some control.

It’s going to be interesting to see how this strategy develops for Elasticsearch (for the last time, the company). At Flax we’ve also built various additional software components for search applications – but as we have no external investors to please these are freely available as open-source software, including Luwak our fast stored query engine, Clade a taxonomy/classification prototype and even some file format extractors.

Principles of Solr application design – part 2 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! Here’s the second part, you can also read the first part.

8. Have enough RAM

The single biggest performance bottleneck in most search installations is lack of RAM. Search is an I/O-intensive process, and the more that disk reads can be cached in memory, the better performance will be. As a rough guideline, your available RAM should be at least 50% the total size of your Solr index files. For demanding applications, up to 100% of the index size may be necessary.

I/O caching is incremental rather than immediate, and some minutes of searches under load may be required to warm them. Don’t expect high performance until the caches are thoroughly warmed up.

An increasingly popular alternative is to use solid state disks (SSDs) instead of traditional hard disks. These are hundreds of times faster, and mean that cold searches should be reasonably fast. They also reduce the amount of RAM required to perhaps as little as 10% of the index size (although as always, this will require testing for the application in question).

9. Use a dedicated machine or VM

Don’t share your Solr servers with any other demanding processes such as SQL databases. For dependable performance, Solr should not have to compete with other processes for resources. VMs are an effective way of ring-fencing resources.

10. Use MMapDirectory and 64-bit systems

By default, Solr on 64-bit systems will open indexes with Lucene’s MMapDirectory, which memory-maps files rather opening them for read/write/seek. Don’t change this! MMapDirectory allows for the most effective use of resources, in particular RAM (which as already described is a crucial resource for search performance).

11. Tune the Solr caches

The OS disk cache improves performance at the low level. At the higher level, Solr has a number of built-in caches which are stored in the JVM heap, and which can improve performance still further. These include the filter cache, the field value cache, the query result cache and the document cache. The filter cache is probably the most important to tune if you are using filtered queries extensively or faceting with the enum method – each entry in the filter cache takes up ( number of docs on shard / 8 ) bytes of space, so if you’ve got a cache limit of 4,000 then you’ll require (numDocs * 500) bytes to hold all of them. However, tuning all of these caches has the potential to improve performance.

To tune the caches, you should allow Solr to run for a while with real or simulated search activity. Then go to the Plugin/Stats page in the admin web interface. The first important number in the cache statistics is ‘hitratio’. This should ideally be as close to 1.0 as possible, indicating that most lookups are being serviced by the cache. Then, ‘evictions’ indicates how many items have been removed from the cache due to limited space. This should ideally be as close to zero as possible, or at least much smaller than ‘lookups’.

If ‘evictions’ is high and ‘hitratio’ low, you should increase the maximum cache size in solrconfig.xml. It is impossible to say what a good starting point for a specific application is, but we often pick 4000.

If the cache is performing well, it may be worth reducing the maximum size and re-testing. The purpose of the maximum size is to prevent the cache growing without limit and filling the JVM heap, which links to point 12 below.

See here more information on Solr caches.

12. Minimise JVM heap space

Once you have tuned your Solr caches, try to reduce the maximum JVM heap (set with -Xmx) to a reasonably small size – big enough to hold the caches and all the other data required for searching and indexing, but not much bigger. There is a graphical depiction of the JVM heap in the Solr admin dashboard which allows a quick overview for rough tuning. For a better picture, it may be worth using a tool like JConsole to monitor the heap as the application is used.

The reason to reduce the heap size is to free RAM for the OS disk cache, as described in point 8.

Garbage collection (GC) can be a problem if the heap size is large. See here for information on GC tuning in Solr and other performance issues.

13. Handle multiple languages with multiple fields

Some search applications need to be able to support documents of different languages within the same index. This may conflict with the use of stemming, stopwords and synonyms to improve search accuracy. Furthermore, languages like Japanese are not tokenised by Solr in the same way as European languages, due to different conventions on word boundaries. One effective method for supporting mutiple languages in an index with per-language term processing is outlined as follows. Note that this depends on knowing in advance what language a section of text is in.

First, create a variant of each text field in the index schema for each language to be supported. The schema.xml supplied with Solr has example fieldtypes for a wide range of languages which may be adapted as necessary. For example:

˂field name="content_en" type="text_en" indexed="true" stored="true"/ ˃
˂field name="content_fr" type="text_fr" indexed="true" stored="true"/ ˃
˂field name="content_jp" type="text_jp" indexed="true" stored="true"/ ˃

Note the use of language codes to distinguish the names of the fields and fieldtypes. Then, when indexing each document, send each section of text to the appropriate field. E.g., if the document is entirely in English, send the whole thing to content_en. If it has sections in English, French and Japanese, send them to content_en, content_fr and content_jp respectively. This ensures that text is tokenised and normalised appropriately for its language.

Finally for searching, use the eDisMax query parser, and include all the language fields in the qf parameter (and pf, if using). E.g., in solrconfig.xml:

˂requestHandler name="/search" class="solr.SearchHandler"˃
˂lst name="defaults"˃
˂str name="qf"˃content_en content_fr content_jp˂/str˃
˂str name="pf"˃content_en content_fr content_jp˂/str˃
...

When a search is executed with this handler, subqueries will be generated for each language with the appropriate term processing, and searched against each language text field. This approach should give the best precision and recall in a multi-language application.

Tags: , , , , ,

Posted in Reference, Technical

December 17th, 2013

No Comments »

Principles of Solr application design – part 1 of 2

We’ve been working internally on a document encapsulating how we build (and recommend others should build) search applications based on Apache Solr, probably the most popular open source search engine library. As an early Christmas present we’re releasing these as a two part series – if you have any feedback we’d welcome comments! So without further ado here’s the first part:

1. Use the latest release of Solr

Unless there are compelling reasons not to, such as reliance on a discontinued feature (which is rare), it is best to use the latest release of Solr, downloaded from http://lucene.apache.org/solr/ . Every minor release in the 4.x series has brought both functional and performance enhancements, and revision releases have fixed known bugs. Since the API (as a rule) remains backwards compatible, the potential gains in performance and utility should outweight the minor inconvenience of the upgrade.

2. Use SolrCloud for scaling and robustness

Before the Solr 4 release, support for sharding (distributing a single search over many Solr instances) and replication (for robustness and scaling search load) involved a significant amount of manual configuration and development. The introduction of SolrCloud means that sharding and replication are now built into the core product, and can be used with simple configuration and no extra coding.

For trivial applications, SolrCloud may not be required, but it is the simplest way to build in robustness and scalability. There’s more about SolrCloud here.

3. Don’t expose the Solr API

Although Solr is not inherently insecure, neither is it designed to be exposed to end-users (and emphatically not to the internet at large). Anyone with access to the root Solr endpoint would be able to delete indexes, modify or insert items at will. Restricting access to search handlers (e.g. /solr/select) avoids this possibility, but is nonetheless a bad idea since it may allow users to construct arbitrary queries which could degrade performance or provide access to unauthorised data. Furthermore, there remains the slim possibility of security holes in the Solr API.

For these reasons, any external access to search should be through a proxy interface which is restricted to the functionality required by the application. Access to the Solr API should be restricted by network design and/or firewalls. This applies equally to AJAX UIs, which should talk to Solr via an intermediary web application rather than directly.

The intermediary code should perform at least some basic validation of parameters before sending to Solr, for example checking their type and ensuring that query strings are under a certain length (depending on the search interface). This allows attempts at compromising the system to be detected at an early stage and blocked.

4. Don’t use third-party Solr client libraries

The problem with third-party client libraries is that they create a tight coupling between the application and Solr. The Solr XML and JSON APIs are simple, and a wide range of client libraries for these formats are readily available for most programming languages. Third-party libraries are an unnecessary additional dependency and a potential source of bugs and unexpected behaviour. Another risk is that development may be discontinued for various reasons, meaning that future Solr features are not easily accessible.

The one exception to this rule is the SolrJ Java client library, since it is part of the general Solr release and is therefore fully compliant with and tested against the corresponding version of Solr.

5. Specify interfaces

All interfaces between components in the application must be agreed between sys ops and developers before development is started. Interfaces should be treated as contracts which software components adhere to. Early documentation of interfaces will reduce the risk of unexpected dependencies leading to problems in deployment.

As far as possible, interfaces should be RESTful web APIs and use standard formats such as JSON and XML. This creates loose coupling between components and also makes it easy to test functionality from the command line or a browser.

6. Put apps live early, on isolated systems

Development should be iterative, with short development cycles (no more than a few weeks). Code should be tested and deployed at the end of each cycle. By using isolated systems, fake data and/or limiting access to authorised testers, functionality and performance may be tested as soon as possible on a ‘live’ system, avoiding the risk of unexpected problems if deployment is postponed until the end of the development cycle.

7. Do realistic performance tests early and often

Except for very small indexes, search performance is often unpredictable, particularly under load. To ensure that performance meets requirements, testing a full index under load with realistic queries should be scheduled as early as possible in development. If you don’t have the data available to create a full index, simulate it (e.g. using freely available text such as Wikipedia).

As new functions, e.g. facets, are added performance characteristics may change significantly, so it is important that performance tests are part of every development cycle. JMeter is a popular tool for load testing; alternatively test scripts could be easily written in a language like Python.

More to come next week!

Tags: , , , , ,

Posted in Reference, Technical

December 11th, 2013

No Comments »

Introducing Luwak, a library for high-performance stored queries

A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */

MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq); /* Create and load a stored query with a single term */

InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build(); /* Load a document (which could be a news article) */

DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */

The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.

Tags: , , , , ,

Posted in Technical

December 6th, 2013

11 Comments »