Archive for April, 2009

Distributed search and partition functions

For most applications, Xapian/Flax’s search performance will be excellent to acceptable on a single machine of reasonable spec (see here for a discussion of CPU and RAM requirements). However, if the document corpus is unusually large – more than about 20 million items – then one server may not be enough for acceptable speed. Xapian provides a mechanism called remote backends which lets the load be shared over several machines, and thus increases the performance. Using this technique, scalability is effectively limitless (hardware budget allowing!) It is sometimes known as sharding.

To illustrate, let’s take a hypothetical news archive as our example. This collects news stories and blog posts from a wide range of sources, adds them to a Xapian index, and allows users to search the archive. For the sake of argument, we’ll say it accumulates about 20 million items per month, and that it started on December 2008. Users can search the story text, and optionally restrict the search to a date range, news source etc.

Ignoring the fine details, this is what data flow would look like on a single machine:

distros1

The current user is searching for “obama” in the date range 1-31 January 2009. Disk blocks which are relevant to this search are shown as “B”, while irrelevant blocks are shown as “b” (only a tiny sample of blocks is illustrated).

Again, for the sake of argument, let’s say this search has to read 10,000 blocks in order to retrieve the result set, taking a few seconds. This is unacceptably slow, so the archive administrators decide to distribute the search over multiple machines, using the Xapian remote backend. They use the documentation here to set up three search servers (to begin with), and put data for December 2008 on the first, Jaunary 2009 on the second, and February 2009 on the third. This seems like a good plan, as it will be easy to add a new machine each month, and start indexing to a new database.

However, this way of partitioning the data is far from optimal, and in the case of the query mentioned above will not provide any performance gain at all. We can see why in the diagram below (RB boxes are Xapian remote backend servers):

distros2

Remember that the user was searching for “obama” in the date range 1-31 January 2009. Since Server 2 contains all the data for this month, and the other servers contain none, this means Server 2 has to do all the work – 10,000 disk reads as before. The end result is that the search is just as slow, and Servers 1 and 3 are idle for this query.

This sort of problem is likely to occur for any partitioning function which is not orthogonal (completely unrelated) to any variable which a user may use in a query. Say instead that the data is partitioned on news source name (Reuters, CNBC, BBC etc). A user may want to search in just one or two sources, in which case the load will again be unevenly distributed over the servers.

How then to partition the documents? One approach is to assign each a unique numerical ID (if not already assigned), divide this by the number of search servers, and take the remainder (mod function). If the remainder is 0, assign this document to the first server; if 1, to the second, and so on. This is shown in the diagram below:

distros31

Now, each server has an approximately equal number of blocks relevant to the query. Each server will therefore complete the query in a third of the time, and since this is in parallel, the overall search will be three times faster.

Any other orthogonal partitioning function would also be suitable, such as one based on a digest of the document content. However, a numerical ID is often the simplest. One problem with this partitioning style is that adding new machines is not such a straightforward procedure, and therefore it is simplest if the number of search nodes is decided at the beginning. Having said that, it is simple enough to repartition the databases if necessary.

We plan to make all of this automatic in a future release of Flax. In the meantime, don’t hesitate to get in touch with us if you have any questions about this or any other search topic.

Tags: ,

Posted in Technical

April 25th, 2009

2 Comments »

Please don’t compete!

Microsoft have been asking open source companies not to compete on cost, but rather on value, according to ZDNet. Unfortunately the response to this hasn’t exactly been positive, as CNET reports. I doubt many open source vendors will be taking much notice of what Microsoft would like them to do, and suspect they will happily continue to make the point that if customers are looking at buying software & services, taking the cost of software completely out of the equation is almost certain to save them money.

Tags: , ,

Posted in Business, News

April 21st, 2009

No Comments »

Flax stack and pre-built binaries

We’ve updated the Flax website with a page showing the Flax software stack – hopefully this will go some way towards explaining how Xapian, Xappy and parts of Flax all fit together. There’s still lots in development so expect some more news later this month.

As part of this, we’ve created a new page bringing together all the Win32 files for Xapian that we maintain – including some pre-built binaries for those of you who don’t want to compile Xapian yourself. We’re working on creating one-click installable packages for bindings for the various languages – however at present we’ve only finished this for Python. Hopefully some users of the other languages will let us know how best to present the other bindings.

Tags: , ,

Posted in News, Technical

April 7th, 2009

No Comments »

Xapian Search Architecture

This is not strictly a Flax post, but is intended to clarify the Xapian search architecture for people using Xapian directly. It’s not intended for experienced Xapian hackers, neither is it a general introduction to using Xapian (see here instead).

The Xapian API is fairly complex, and there is often confusion about the role of the QueryParser, terms, document values, document data etc. in indexing and searching. It is probably worth pointing out that Xapian has nothing that resembles a “field” – that is a higher level abstraction relevant to Flax components. Xapian just has documents, identified by an integer ID, which can contain:

  • terms (usually words or short phrases, with optional positional information),
  • values (usually short strings, often containing binary data) and 
  • document data (any data, but often some text suitable for display).

These three types of object are independent, though often related at an application level (terms are often derived from the document data). At indexing time (to be covered in a future post) terms may be generated with an instance of TextProcessor, or they may be generated directly. Terms are designed for general text searching – does a particular word appear in a document? Values are used for a variety of other purposes such as range searching (e.g. dates), sort keys, and application-specific code. The document data cannot be used for searching; it is available for the application to use post-search, e.g. for displaying document information in a list of search results.

When documents are added to a Xapian database, the terms, values and document data are stored separately in structures designed for the optimum lookup speed in each case. This is shown in the diagram below:

Xapian Search Architecture

The search process here is shown in a greatly simplified conceptual form. But essentially, queries can match terms and/or values. A typical term query could be something like

llamas AND yaks

The Xapian matcher looks up the terms in the database and starts to collect a list of documents containing the terms. If the query consists only of terms, this list comprises the search results and is delivered to the client code as an MSet.

If, however, the query contains value range components (such as 1st January 2009 – 2nd April 2009), there is a second stage of searching. The matcher looks up the appropriate values for each of the documents in the candidate MSet and compares it to the query range. If it falls outside, the document is rejected, otherwise the document is passed on to the client.

Finally, it is possible to customise the search process by specifying a subclass of Xapian’s MatchDecider class. These have access to any of the document properties, but for reasons of performance usually utilise values. A typical application would be to filter search results according to a user’s security permissions. The match decider would compare the ACL of each candidate document with the user’s permissions, and reject or pass the document accordingly.

Search optimisation

Although this model is oversimplified, it can help to think about it when designing a database schema and search design. In general, term and position lists are much larger on the disk than values, and therefore the performance of term searching is limited by I/O. This is why having enough RAM available for disk caching (10% of the DB size or more) can greatly increase search performance.

By contrast, value range searches and match deciders are more computationally intensive and tend to be CPU-bound. Without any terms to limit the size of the candidate MSet, pure value searches have to grind through the whole set of values, and can be very slow for large databases.

If range searches are required, a big performance win can be gained by adding terms to partition the database. For example, dates may indexed by YYYYMM terms with a suitable prefix. Then a date range search can be augmented by a term search which cuts down the number of documents to process. E.g. 16th January 2009 – 2nd April 2009 would produce the term query:

(XD200901 OR XD200902 OR XD200903 OR XD200904)

This covers more of January and April than necessary, but this will be taken care of in the subsequent range search. And the term query will greatly reduce the number of documents which need to be processed, increasing the search speed.

However, the right level of granularity must be chosen. If we indexed individual days rather than months then the date term query would have 76 terms, and the performance would be at risk of becoming I/O-bound.

All of this is pretty complicated, which is why we’re adding it to Flax. Our aim is for Flax to automatically generate the optimum schema and query design for any set of input documents, with just a few high-level hints from the user. Keep watching here for news.

Tags:

Posted in Technical

April 2nd, 2009

7 Comments »