We’ve updated the Flax website with a page showing the Flax software stack – hopefully this will go some way towards explaining how Xapian, Xappy and parts of Flax all fit together. There’s still lots in development so expect some more news later this month.
As part of this, we’ve created a new page bringing together all the Win32 files for Xapian that we maintain – including some pre-built binaries for those of you who don’t want to compile Xapian yourself. We’re working on creating one-click installable packages for bindings for the various languages – however at present we’ve only finished this for Python. Hopefully some users of the other languages will let us know how best to present the other bindings.
This is not strictly a Flax post, but is intended to clarify the Xapian search architecture for people using Xapian directly. It’s not intended for experienced Xapian hackers, neither is it a general introduction to using Xapian (see here instead).
The Xapian API is fairly complex, and there is often confusion about the role of the QueryParser, terms, document values, document data etc. in indexing and searching. It is probably worth pointing out that Xapian has nothing that resembles a “field” – that is a higher level abstraction relevant to Flax components. Xapian just has documents, identified by an integer ID, which can contain:
- terms (usually words or short phrases, with optional positional information),
- values (usually short strings, often containing binary data) and
- document data (any data, but often some text suitable for display).
These three types of object are independent, though often related at an application level (terms are often derived from the document data). At indexing time (to be covered in a future post) terms may be generated with an instance of TextProcessor, or they may be generated directly. Terms are designed for general text searching – does a particular word appear in a document? Values are used for a variety of other purposes such as range searching (e.g. dates), sort keys, and application-specific code. The document data cannot be used for searching; it is available for the application to use post-search, e.g. for displaying document information in a list of search results.
When documents are added to a Xapian database, the terms, values and document data are stored separately in structures designed for the optimum lookup speed in each case. This is shown in the diagram below:

The search process here is shown in a greatly simplified conceptual form. But essentially, queries can match terms and/or values. A typical term query could be something like
llamas AND yaks
The Xapian matcher looks up the terms in the database and starts to collect a list of documents containing the terms. If the query consists only of terms, this list comprises the search results and is delivered to the client code as an MSet.
If, however, the query contains value range components (such as 1st January 2009 – 2nd April 2009), there is a second stage of searching. The matcher looks up the appropriate values for each of the documents in the candidate MSet and compares it to the query range. If it falls outside, the document is rejected, otherwise the document is passed on to the client.
Finally, it is possible to customise the search process by specifying a subclass of Xapian’s MatchDecider class. These have access to any of the document properties, but for reasons of performance usually utilise values. A typical application would be to filter search results according to a user’s security permissions. The match decider would compare the ACL of each candidate document with the user’s permissions, and reject or pass the document accordingly.
Search optimisation
Although this model is oversimplified, it can help to think about it when designing a database schema and search design. In general, term and position lists are much larger on the disk than values, and therefore the performance of term searching is limited by I/O. This is why having enough RAM available for disk caching (10% of the DB size or more) can greatly increase search performance.
By contrast, value range searches and match deciders are more computationally intensive and tend to be CPU-bound. Without any terms to limit the size of the candidate MSet, pure value searches have to grind through the whole set of values, and can be very slow for large databases.
If range searches are required, a big performance win can be gained by adding terms to partition the database. For example, dates may indexed by YYYYMM terms with a suitable prefix. Then a date range search can be augmented by a term search which cuts down the number of documents to process. E.g. 16th January 2009 – 2nd April 2009 would produce the term query:
(XD200901 OR XD200902 OR XD200903 OR XD200904)
This covers more of January and April than necessary, but this will be taken care of in the subsequent range search. And the term query will greatly reduce the number of documents which need to be processed, increasing the search speed.
However, the right level of granularity must be chosen. If we indexed individual days rather than months then the date term query would have 76 terms, and the performance would be at risk of becoming I/O-bound.
All of this is pretty complicated, which is why we’re adding it to Flax. Our aim is for Flax to automatically generate the optimum schema and query design for any set of input documents, with just a few high-level hints from the user. Keep watching here for news.
Anurag Goel recently carried out a comparitive test of Xapian/Flax and Lucene/Solr. Some interesting results here: it seems Lucene is faster at building indexes, but Xapian is faster and possibly more accurate at searching. We can expect some further speed improvements over the next few months as a new, more compact backend to Xapian is released.
By the way, the article mentions Xappy: this is a Python interface to Xapian that is a major part of our Flax enterprise search platform. You can get Xappy here.
Searching images is a difficult problem, and it’s not a feature offered by many commercial search engines. Some will cheat slightly, by indexing the title or filename of the image, or the text surrounding an image embedded on a page, and call this ‘image search’ – but this method doesn’t work very well, especially when you have a standalone image called ‘IMG0000064.jpg’ which is actually a picture of an apple. We’ve seen some good demos of actual image search – Imense is particularly impressive – but none that promise a generic solution that will work with all images.
In the meantime we’ve been developing some image related search technology for one of our clients, and we can now offer image similarity matching as part of Flax – you can read more about this exciting development on the Searching with Xapian blog, written by my colleague Richard Boulton.
Stephen Arnold recently posted some rather impressive performance figures for Autonomy’s IDOL search engine. This kind of data is all very well, but without independent testing and more detail it’s hard to know how these figures apply to the real world.
So here’s an idea. Why not create an openly available collection of test data, a set of searches and a set of conditions, then compare the performance of the various available engines for indexing and searching? Recording the software and hardware used as well, of course. Making the data and conditions public would allow for independent verification.
I’m not sure commercial search vendors would ever agree to this, but it’s a nice idea.
Some more resources for those looking for open source search components: Search Components Online has some great lists, including an exhaustive list of filters based on an article by New Idea Engineering called Where Have All The Filters Gone? Filters in this case are defined as components that extract plain text and metadata from file formats, i.e. Adobe PDF or Microsoft Word.
Another more general list of search engines and technologies can be found at SearchTools.com, although parts of this are a little out of date.
One of the challenges we often come up against is indexing data held in other proprietary or open source systems, such as databases or content management systems. Talend is an open source data integration platform that lets you connect to a huge variety of these systems, from Salesforce to Oracle to SugarCRM. Talend is an offshoot of the Eclipse open source community. We’ll be following the development of Talend with interest.
There’s also the related problem of translating file formats before indexing them. Luckily there are lots of open source converters (as used by Omega, part of Xapian), or if you run on a Microsoft platform there’s IFilters – the latter aren’t open source, but you can easily connect to them from another program using COM. In our experience, the IFilters are better at extracting content from Microsoft-specific formats .
UPDATE: I’ve also recently discovered the Tika project, under the Apache umbrella. Not a lot of formats supported so far, but it’s a start.
Based on some feedback, we’ve made some more technical details about Flax available on our Features page. You can download the PDF here.