Comments on: Updating individual fields in Lucene with a Redis-backed codec The Open Source Search Specialists Tue, 12 Feb 2019 14:44:32 +0000 hourly 1 By: Aditya Tripathi Sat, 14 Jun 2014 03:21:40 +0000 Thanks Charlie for your response.

By: charlie Fri, 13 Jun 2014 16:30:30 +0000 Hi Aditya,

Thanks for your comment. The Redis Codec was just a proof of concept and we haven’t taken the idea any further. I think the Lucene mailing list (which I note you have already posted to) will be your best source of further help. There have been some recent improvements to Lucene DocValues which might also be worth investigating.


By: Aditya Tripathi Fri, 13 Jun 2014 07:13:18 +0000 Could you implement this? We also tried a similar (but not so similar) approach but we had some problems which could not get solved with codecs approach. Do put your thoughts and comments on the following:

Problem 1: Merge – Lucene Merge Thread keeps the new merged segment as a checkpointed segment and it is not committed.
There are two possible approaches here:
a) The custom Postings Consumer/ Terms Consumer does not write the merged information (docId renumbering info) to redis and instead store it in-memory. Partial Producer can search this in-memory structure for the new merged segment.Or,
b) At merge, write the new merged state to the redis store.

The problem with approach b) is that Reader may not be opened to the new merged segment, but redis store has removed the old segments which got merged. The search will fail in this case.

The problem with approach a) is that you can write that in-memory merge info to redis only at the next flush. The reason for that is that a custom PostingsFormat is invoked only at flush() or merge(). However, in a case like Solr’s Optimize command, there can be a commit without anything to flush. In this case, the in-memory merge info will not get to the redis store, but the uncommitted merged segment is committed in the Lucene Index. This is not a problem for search, however, in case you do something like Optimize and replicate, replication will be taking different information from stable storage.

Problem 2: Replication – How do you sync up indexed data in Lucene’s index dir and redis data directory. They both are asynchronous writes to stable storage.

problem 3: As mentioned above, the custom PostingsFormat is only invoked at flush or merge. If you add documents and then update this field, both in the same segment. This is not possible because this field with custom postings format is not yet written anywhere.

problem 4:LiveDocs issues. Lucene can mark a document dead and the custom postings format will get this information only at merge time. It appears that it is not a problem because a dead doc will be discovered by DocsEnum in search process. But this is a problem in reindexing docs when the reindexed docs go to an uncommited segment.

problem 5: A segment with no live docs is dropped at the next commit. This drop information does not go to the custom postings consumer. And it becomes messy to check for a segment with all docs dead at every flush in redis. Again, dead docs or dropped segment remaining in redis might not be a critical issue to solve – but it depends on your reindexing requirements.

By: Writing a new Lucene Codec | Romsey Software Wed, 04 Jul 2012 14:33:24 +0000 […] postings data in Redis, as part of a proof-of-concept project investigating updateable fields (see this blog post for more […]

By: Alan Woodward Mon, 02 Jul 2012 08:08:23 +0000 Hello, author of the POC here.

Mike is right here, merged segments are written through the codec so the document id remapping happens automatically. I hadn’t got as far as dealing with segment deletes yet, but it should be pretty simple – segment names are included in the redis key names, and it’s trivial to get a list of keys that match a given pattern (in this case, *_segmentname) and delete them.

The other obvious improvement here is to generalise the updates to use the existing codec writing machinery. At the moment we’re using a very naive postings format (basically just a list of integers, no skip lists or compression, and no support for frequency or position information). It should be possible to write something that combines an existing DocsEnum/DocsAndPositionsEnum with a series of Diffs, so that we can store postings data in one of the existing compressed formats and then rewrite term entries by streaming the data in, applying the diffs, and writing it out again in a format-independent fashion.

By: Mike McCandless Sat, 30 Jun 2012 10:17:40 +0000 Hi Tom,

The codec is in fact used to write the newly merged segment, so the Redis codec will see docs 1-6 being written. So I think you’ll be fine there.

Though, likely you’ll need to do something (maybe make a Directory wrapper?) to delete the postings from Redis when Lucene deletes segment files (after merge).


By: Tom Mon, 25 Jun 2012 08:15:21 +0000 Thanks Mike!

The merge issue is down to the fact that Lucene segments do get replaced during a merge. e.g. say I have two segments, each with three docs:

[1, 2, 3] + [1, 2, 3]

then after merging we will just have

[1, 2, 3, 4, 5, 6]

and the redis codec will have to know about this. (I’m not the guy who implemented the POC so my understanding might be a bit off..)

By: Mike McCandless Fri, 22 Jun 2012 18:25:20 +0000 Very cool!

I think on merge you shouldn’t have to remap document IDs? Once a segment is written, its docIDs are fixed, and merging just writes a new segment. So I think it should “just work”.
