Updating individual fields in Lucene with a Redis-backed codec

A customer of ours has a potential search application which requires (largely for reasons of performance) the ability to update specific individual fields of Apache Lucene documents. This is not the first time that someone has asked for this functionality. However, until now, it has been impossible to change field values in a Lucene document without re-indexing the entire document. This was due to the write-once design of Lucene index segment files, which would necessitate re-writing the entire file if a single value changes.

However, the introduction of pluggable codecs in Lucene 4.0 means that the concrete representation of index segments has been abstracted away from search functionality, and can be specified by the codec designer. The motivation for this was to make it possible to experiment with new compression schemes and other innovations, however it may also make it possible to overcome the current limitation of whole-document-only updates.

Andrzej Bialecki has proposed a “stacked update” design on top of the Lucene index format, in which changed fields are represented by “diff” documents which “overlay” the values of an existing document. If the “diff” document does not contain a certain field, then the value is taken from the original, overlaid document. This design is currently a work in progress.

Approaching the challenge independently, we have started to experiment with an alternative design, which makes a clear distinction between updatable and non-updateable fields. This is arguably a limitation, but one which may not be important in many practical applications (e.g. adding user tags to documents in a corpus). Non-updatable fields are stored using the standard Lucene codec, while updatable fields are stored externally by a codec that uses Redis, an open-source, flexible, fast key-value store. Updates to these fields could then be made directly in the Redis store using the JRedis library.

We have written a minimal, 2-day proof of concept, which can be checked out with:

svn checkout http://flaxcode.googlecode.com/svn/trunk/LuceneRedisCodec

There is still a significant amount of work to be done to make this approach robust and performant (e.g. when Lucene merges segments, the Redis document IDs will have to be remapped). At this stage we would welcome any comments and suggestions about our approach from anyone who is interested in this area of functionality.

Tags: , , , , ,

This entry was posted on Friday, June 22nd, 2012 at 9:55 am and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to “Updating individual fields in Lucene with a Redis-backed codec”

  1. Mike McCandless 22nd June , 2012

    Very cool!

    I think on merge you shouldn’t have to remap document IDs? Once a segment is written, its docIDs are fixed, and merging just writes a new segment. So I think it should “just work”.

  2. Thanks Mike!

    The merge issue is down to the fact that Lucene segments do get replaced during a merge. e.g. say I have two segments, each with three docs:

    [1, 2, 3] + [1, 2, 3]

    then after merging we will just have

    [1, 2, 3, 4, 5, 6]

    and the redis codec will have to know about this. (I’m not the guy who implemented the POC so my understanding might be a bit off..)

  3. Mike McCandless 30th June , 2012

    Hi Tom,

    The codec is in fact used to write the newly merged segment, so the Redis codec will see docs 1-6 being written. So I think you’ll be fine there.

    Though, likely you’ll need to do something (maybe make a Directory wrapper?) to delete the postings from Redis when Lucene deletes segment files (after merge).

    Mike

    http://blog.mikemccandless.com

  4. Alan Woodward 2nd July , 2012

    Hello, author of the POC here.

    Mike is right here, merged segments are written through the codec so the document id remapping happens automatically. I hadn’t got as far as dealing with segment deletes yet, but it should be pretty simple – segment names are included in the redis key names, and it’s trivial to get a list of keys that match a given pattern (in this case, *_segmentname) and delete them.

    The other obvious improvement here is to generalise the updates to use the existing codec writing machinery. At the moment we’re using a very naive postings format (basically just a list of integers, no skip lists or compression, and no support for frequency or position information). It should be possible to write something that combines an existing DocsEnum/DocsAndPositionsEnum with a series of Diffs, so that we can store postings data in one of the existing compressed formats and then rewrite term entries by streaming the data in, applying the diffs, and writing it out again in a format-independent fashion.

  5. [...] postings data in Redis, as part of a proof-of-concept project investigating updateable fields (see this blog post for more [...]

Leave a Reply

  • « Older Entries
  • Newer Entries »