While working on a customer project recently we found a very large field that was stored unnecessarily in the Lucene index, taking up a lot of space. As it would have taken a very long time to re-index (there are tens of millions of complex documents in this case) we looked for a way to remove the stored field in-place.
There’s an interesting set of slides from last year’s Apache Lucene Eurocon which discuss this kind of Lucene index post-processing, but we didn’t find any tools to do this particular task (although this doesn’t mean they don’t exist – for example Luke may be helpful). So we wrote our own, based on some examples in the ‘contrib’ directory of Solr 4. We override the document() methods of FilterIndexReader to remove the required field from each returned Document’s field list. Terms aren’t interfered with, so it really is like changing the field from being stored to not being stored; it’s still indexed.
The code is available here. It’s written against Lucene 2.9.3 (which is contained in Solr 1.4.1).