Out with the old – and in with the new Lucene query parser?

Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of stored expressions exist and it is difficult or uneconomic to translate them all by hand. Our usual approach is to write a query parser, which understands the current syntax but creates a query suitable for a modern open source search engine based on Apache Lucene. We’ve done this for legacy engines including dtSearch and Verity and also for in-house query languages developed by clients themselves. This allows you to keep the existing syntax but improve performance, scalability and accuracy of your search engine.

There are a few points to note during this process:

  • What appears to be a simple query in your current language may not translate to a simple Lucene query, which may lead to performance issues if you are not careful. Wildcards for example can be very expensive to process.
  • You cannot guarantee that the new search system will return exactly the same results, in the same order, as the old one, no matter how carefully the query parser is designed. After all, the underlying search engine algorithms are different.
  • Some element of manual translation may be necessary for particularly large, complex or unusual queries, especially if the original intention of the person who wrote the query is unclear.
  • You may want to create a vendor-neutral query language as an intermediate step – so you can migrate more easily next time. We’ve done this for Danish media monitors Infomedia.
  • If you have particularly large and/or complex queries that may have been added to incrementally over time, they may contain errors or logistical inconsistencies – which your current engine may not be telling you about! If you find these you have two choices: fix the query expression (which may then give you slightly different results) or make the new system give the same (incorrect) results as before.

To mitigate these issues it is important to decide on a test set of queries and expected results, and what level of ‘correctness’ is required – bearing in mind 100% is going to be difficult if not impossible. If you are dealing with languages outside the experience of the team you should also make sure you have access to a native speaker – so you can be sure that results really are relevant!

Do let us know if you’re planning this kind of migration and how we can help – building Lucene query parsers is not a simple task and some past experience can be invaluable.

Leave a Reply

Your email address will not be published. Required fields are marked *