This is not strictly a Flax post, but is intended to clarify the Xapian search architecture for people using Xapian directly. It’s not intended for experienced Xapian hackers, neither is it a general introduction to using Xapian (see here instead).
The Xapian API is fairly complex, and there is often confusion about the role of the QueryParser, terms, document values, document data etc. in indexing and searching. It is probably worth pointing out that Xapian has nothing that resembles a “field” – that is a higher level abstraction relevant to Flax components. Xapian just has documents, identified by an integer ID, which can contain:
- terms (usually words or short phrases, with optional positional information),
- values (usually short strings, often containing binary data) and
- document data (any data, but often some text suitable for display).
These three types of object are independent, though often related at an application level (terms are often derived from the document data). At indexing time (to be covered in a future post) terms may be generated with an instance of TextProcessor, or they may be generated directly. Terms are designed for general text searching – does a particular word appear in a document? Values are used for a variety of other purposes such as range searching (e.g. dates), sort keys, and application-specific code. The document data cannot be used for searching; it is available for the application to use post-search, e.g. for displaying document information in a list of search results.
When documents are added to a Xapian database, the terms, values and document data are stored separately in structures designed for the optimum lookup speed in each case. This is shown in the diagram below:
The search process here is shown in a greatly simplified conceptual form. But essentially, queries can match terms and/or values. A typical term query could be something like
llamas AND yaks
The Xapian matcher looks up the terms in the database and starts to collect a list of documents containing the terms. If the query consists only of terms, this list comprises the search results and is delivered to the client code as an MSet.
If, however, the query contains value range components (such as 1st January 2009 – 2nd April 2009), there is a second stage of searching. The matcher looks up the appropriate values for each of the documents in the candidate MSet and compares it to the query range. If it falls outside, the document is rejected, otherwise the document is passed on to the client.
Finally, it is possible to customise the search process by specifying a subclass of Xapian’s MatchDecider class. These have access to any of the document properties, but for reasons of performance usually utilise values. A typical application would be to filter search results according to a user’s security permissions. The match decider would compare the ACL of each candidate document with the user’s permissions, and reject or pass the document accordingly.
Although this model is oversimplified, it can help to think about it when designing a database schema and search design. In general, term and position lists are much larger on the disk than values, and therefore the performance of term searching is limited by I/O. This is why having enough RAM available for disk caching (10% of the DB size or more) can greatly increase search performance.
By contrast, value range searches and match deciders are more computationally intensive and tend to be CPU-bound. Without any terms to limit the size of the candidate MSet, pure value searches have to grind through the whole set of values, and can be very slow for large databases.
If range searches are required, a big performance win can be gained by adding terms to partition the database. For example, dates may indexed by YYYYMM terms with a suitable prefix. Then a date range search can be augmented by a term search which cuts down the number of documents to process. E.g. 16th January 2009 – 2nd April 2009 would produce the term query:
(XD200901 OR XD200902 OR XD200903 OR XD200904)
This covers more of January and April than necessary, but this will be taken care of in the subsequent range search. And the term query will greatly reduce the number of documents which need to be processed, increasing the search speed.
However, the right level of granularity must be chosen. If we indexed individual days rather than months then the date term query would have 76 terms, and the performance would be at risk of becoming I/O-bound.
All of this is pretty complicated, which is why we’re adding it to Flax. Our aim is for Flax to automatically generate the optimum schema and query design for any set of input documents, with just a few high-level hints from the user. Keep watching here for news.