Xapian Search Architecture

This is not strictly a Flax post, but is intended to clarify the Xapian search architecture for people using Xapian directly. It’s not intended for experienced Xapian hackers, neither is it a general introduction to using Xapian (see here instead).

The Xapian API is fairly complex, and there is often confusion about the role of the QueryParser, terms, document values, document data etc. in indexing and searching. It is probably worth pointing out that Xapian has nothing that resembles a “field” – that is a higher level abstraction relevant to Flax components. Xapian just has documents, identified by an integer ID, which can contain:

  • terms (usually words or short phrases, with optional positional information),
  • values (usually short strings, often containing binary data) and 
  • document data (any data, but often some text suitable for display).

These three types of object are independent, though often related at an application level (terms are often derived from the document data). At indexing time (to be covered in a future post) terms may be generated with an instance of TextProcessor, or they may be generated directly. Terms are designed for general text searching – does a particular word appear in a document? Values are used for a variety of other purposes such as range searching (e.g. dates), sort keys, and application-specific code. The document data cannot be used for searching; it is available for the application to use post-search, e.g. for displaying document information in a list of search results.

When documents are added to a Xapian database, the terms, values and document data are stored separately in structures designed for the optimum lookup speed in each case. This is shown in the diagram below:

Xapian Search Architecture

The search process here is shown in a greatly simplified conceptual form. But essentially, queries can match terms and/or values. A typical term query could be something like

llamas AND yaks

The Xapian matcher looks up the terms in the database and starts to collect a list of documents containing the terms. If the query consists only of terms, this list comprises the search results and is delivered to the client code as an MSet.

If, however, the query contains value range components (such as 1st January 2009 – 2nd April 2009), there is a second stage of searching. The matcher looks up the appropriate values for each of the documents in the candidate MSet and compares it to the query range. If it falls outside, the document is rejected, otherwise the document is passed on to the client.

Finally, it is possible to customise the search process by specifying a subclass of Xapian’s MatchDecider class. These have access to any of the document properties, but for reasons of performance usually utilise values. A typical application would be to filter search results according to a user’s security permissions. The match decider would compare the ACL of each candidate document with the user’s permissions, and reject or pass the document accordingly.

Search optimisation

Although this model is oversimplified, it can help to think about it when designing a database schema and search design. In general, term and position lists are much larger on the disk than values, and therefore the performance of term searching is limited by I/O. This is why having enough RAM available for disk caching (10% of the DB size or more) can greatly increase search performance.

By contrast, value range searches and match deciders are more computationally intensive and tend to be CPU-bound. Without any terms to limit the size of the candidate MSet, pure value searches have to grind through the whole set of values, and can be very slow for large databases.

If range searches are required, a big performance win can be gained by adding terms to partition the database. For example, dates may indexed by YYYYMM terms with a suitable prefix. Then a date range search can be augmented by a term search which cuts down the number of documents to process. E.g. 16th January 2009 – 2nd April 2009 would produce the term query:

(XD200901 OR XD200902 OR XD200903 OR XD200904)

This covers more of January and April than necessary, but this will be taken care of in the subsequent range search. And the term query will greatly reduce the number of documents which need to be processed, increasing the search speed.

However, the right level of granularity must be chosen. If we indexed individual days rather than months then the date term query would have 76 terms, and the performance would be at risk of becoming I/O-bound.

All of this is pretty complicated, which is why we’re adding it to Flax. Our aim is for Flax to automatically generate the optimum schema and query design for any set of input documents, with just a few high-level hints from the user. Keep watching here for news.

Share this postShare on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on RedditEmail this to someone

7 thoughts on “Xapian Search Architecture

  1. Pingback: Xapian terms, values and data explained « Searching with Xapian

  2. I need know if the calculated weight for each term (word) is the multiplication of
    tf * idf. So if the multiplication of the importance of frequency of occurrence of each word in the text per the inverse frequency of each word in the collection.
    Please I need help with this, expect quick responses

    Yoa

  3. Thanks for answering, but the inverse frequency is used to calculate the weight BM25?. I studied the BM25 and I think only works with the importance of frequency of occurrence.
    I hope you answer

    Yoa

  4. yoa – if you want to discuss the weighting model used by Xapian, the Xapian mailing list, or the Xapian IRC channel, would probably be better places: see http://xapian.org/lists and #xapian on irc.freenode.net But briefly, BM25 uses the frequency of terms in both documents and across the database as a whole, as well as information on the document length and the frequency of terms in the query. ie – it uses a superset of the statistics used by tf-idf, and combines them differently. The formula has performed a lot better in many retrieval experiments, and has some sound theoretical basis justifying its derivation.

  5. Thanks for answering, I appreciate it, I’m working on my thesis for engineer and I work with xapian but is new for me.

    thank you again,
    Yoa

  6. Pingback: Searching date ranges in Xapian and PHP | Geert Van Damme

Leave a Reply

Your email address will not be published. Required fields are marked *