Posts Tagged ‘indexing’

Updating individual fields in Lucene with a Redis-backed codec

A customer of ours has a potential search application which requires (largely for reasons of performance) the ability to update specific individual fields of Apache Lucene documents. This is not the first time that someone has asked for this functionality. However, until now, it has been impossible to change field values in a Lucene document without re-indexing the entire document. This was due to the write-once design of Lucene index segment files, which would necessitate re-writing the entire file if a single value changes.

However, the introduction of pluggable codecs in Lucene 4.0 means that the concrete representation of index segments has been abstracted away from search functionality, and can be specified by the codec designer. The motivation for this was to make it possible to experiment with new compression schemes and other innovations, however it may also make it possible to overcome the current limitation of whole-document-only updates.

Andrzej Bialecki has proposed a “stacked update” design on top of the Lucene index format, in which changed fields are represented by “diff” documents which “overlay” the values of an existing document. If the “diff” document does not contain a certain field, then the value is taken from the original, overlaid document. This design is currently a work in progress.

Approaching the challenge independently, we have started to experiment with an alternative design, which makes a clear distinction between updatable and non-updateable fields. This is arguably a limitation, but one which may not be important in many practical applications (e.g. adding user tags to documents in a corpus). Non-updatable fields are stored using the standard Lucene codec, while updatable fields are stored externally by a codec that uses Redis, an open-source, flexible, fast key-value store. Updates to these fields could then be made directly in the Redis store using the JRedis library.

We have written a minimal, 2-day proof of concept, which can be checked out with:

svn checkout http://flaxcode.googlecode.com/svn/trunk/LuceneRedisCodec

There is still a significant amount of work to be done to make this approach robust and performant (e.g. when Lucene merges segments, the Redis document IDs will have to be remapped). At this stage we would welcome any comments and suggestions about our approach from anyone who is interested in this area of functionality.

Tags: , , , , ,

Posted in Technical

June 22nd, 2012

5 Comments »

How to remove a stored field in Lucene

While working on a customer project recently we found a very large field that was stored unnecessarily in the Lucene index, taking up a lot of space. As it would have taken a very long time to re-index (there are tens of millions of complex documents in this case) we looked for a way to remove the stored field in-place.

There’s an interesting set of slides from last year’s Apache Lucene Eurocon which discuss this kind of Lucene index post-processing, but we didn’t find any tools to do this particular task (although this doesn’t mean they don’t exist – for example Luke may be helpful). So we wrote our own, based on some examples in the ‘contrib’ directory of Solr 4. We override the document() methods of FilterIndexReader to remove the required field from each returned Document’s field list. Terms aren’t interfered with, so it really is like changing the field from being stored to not being stored; it’s still indexed.

The code is available here. It’s written against Lucene 2.9.3 (which is contained in Solr 1.4.1).

Tags: , , , ,

Posted in Technical

June 24th, 2011

No Comments »

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Further revolutions

Back for the second day of Lucene Revolution, with some great talks on migrating to Solr from FAST ESP, the new flexible indexing features coming to Lucene ‘real soon now’, and finishing off with a panel discussion. I felt privileged to sit as part of this panel between Eric Gries, CEO of Lucid Imagination, and Paul Doscher of Exalead – the discussion was lively and interesting (I hope!) to the audience.

I’m looking forward to returning to the UK with all I’ve learnt from this event, and to follow up on some of the ideas generated – for example, it would be great to be able to demonstrate Lucid Works Enterprise to interested parties in London.

Thanks to Stephen Arnold’s team and all at Lucid Imagination for organising such a great conference. It won’t be the last I’m sure!

Tags: , , , ,

Posted in events

October 8th, 2010

No Comments »

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Packaged solutions and customisability, the Python way

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

Tags: , , , ,

Posted in Technical

June 14th, 2010

No Comments »

Finding French TV with Flax

We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.

mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.

This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.

Tags: , ,

Posted in News

November 26th, 2009

No Comments »

When real-time search isn’t

Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.

From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.

A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.

The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.

In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.

Tags: ,

Posted in News, Technical

November 5th, 2009

2 Comments »

Search requirements and asking the right questions

When we’re contacted by potential clients, we have to gather as much information as possible about how and why they need search technology. This either takes the form of a physical or telephone meeting and much scribbling in notebooks, or a long exchange of emails. In all cases there are some important questions that must be answered, and I thought it might be useful to list the most common ones here:

How many items do you need to search?

The number of items to search varies widely, from a few thousand to hundreds of millions. This number impacts both the eventual size of a searchable index and how fast it can be built, and will thus inform the eventual system design, both in hardware and software terms. It’s usually possible to  search from 5 to 50 million items on a single server – but this also depends on the answer to the next questions:

How big/complex are the items to be searched?

This includes both the size of each item and what data it contains: for example does each item contain a price, or a characteristic like an author’s name, or colour. The item can be part of a group of items, have user tags applied, or be restricted to a certain group of users. The searchable index we build will have to take account of all this information in the correct way, so we can search it effectively.

What other systems must the search engine work with?

Sometimes search engines will have to fit into an existing infrastructure – say an intranet or web application framework – and sometimes they will have to extract information from another system, such as a relational database. The engine may also have to take account of existing security systems, which can impact how each search result is delivered. It may have to deliver search results as a web page, or as a report, or as an email. There’s obviously a huge variety of possible systems to interact with, not least the operating system or platform.

What’s your schedule for delivering a search solution?

This is another key point – it can be relatively quick to build a simple search application, but if the system is going to be very large or very complex, or if a staged delivery based on user feedback is required, then it’s important to know what the expectations are. We’ve installed systems in a couple of days, and built more complex ones over years.

In all cases it’s important to realise that every client will have differing requirements and expectations, and to be sure that everyone ends up satisfied with the end result, the more information we can gather at the start of the process, the better.

Tags: , ,

Posted in Business

March 19th, 2009

No Comments »