Author Archive

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Flax Search Service alpha release

The Flax team are pleased to announce the alpha release of Flax Search Service (FSS). FSS combines powerful, high-level indexing and search features with a well-designed Web Services interface. FSS is Open Source software (under the MIT licence) and is available as a free download from Google Code.

Web Services and Service Oriented Architectures (SOA) have become increasingly popular in recent years due to their many advantages. FSS provides a RESTful interface in which databases, documents, and searches are represented as resources identified by URLs. For example, to add a document to a database,the document data is POSTed to the database resource. To search for a word or phrase,the client sends the query as a GET request to the database, which responds with a list of matching documents. Indexing transactions may be handled automatically or explicitly by the client.

For convenience, client libraries are being developed in several languages, including PHP, Python, Java and JavaScript. It would be a simple matter to interface to FSS in any language with support for Web protocols. The FSS distribution also includes example code to get you started, and basic documentation.

FSS alpha supports enough indexing and search functionality to implement basic but useful information retrieval systems. Over the next few months we will be adding support for advanced features like facets and tags, geolocation and image search. It will run on any system with support for Xapian and Python (Windows, Linux and Mac amongst others).

Tags: , ,

Posted in Uncategorized

June 3rd, 2009

No Comments »

Migrating from lemurconsulting.com

We finally decided to move entirely to flax.co.uk. The one page remaining is the news archive.

Tags:

Posted in News

May 6th, 2009

No Comments »

Distributed search and partition functions

For most applications, Xapian/Flax’s search performance will be excellent to acceptable on a single machine of reasonable spec (see here for a discussion of CPU and RAM requirements). However, if the document corpus is unusually large – more than about 20 million items – then one server may not be enough for acceptable speed. Xapian provides a mechanism called remote backends which lets the load be shared over several machines, and thus increases the performance. Using this technique, scalability is effectively limitless (hardware budget allowing!) It is sometimes known as sharding.

To illustrate, let’s take a hypothetical news archive as our example. This collects news stories and blog posts from a wide range of sources, adds them to a Xapian index, and allows users to search the archive. For the sake of argument, we’ll say it accumulates about 20 million items per month, and that it started on December 2008. Users can search the story text, and optionally restrict the search to a date range, news source etc.

Ignoring the fine details, this is what data flow would look like on a single machine:

distros1

The current user is searching for “obama” in the date range 1-31 January 2009. Disk blocks which are relevant to this search are shown as “B”, while irrelevant blocks are shown as “b” (only a tiny sample of blocks is illustrated).

Again, for the sake of argument, let’s say this search has to read 10,000 blocks in order to retrieve the result set, taking a few seconds. This is unacceptably slow, so the archive administrators decide to distribute the search over multiple machines, using the Xapian remote backend. They use the documentation here to set up three search servers (to begin with), and put data for December 2008 on the first, Jaunary 2009 on the second, and February 2009 on the third. This seems like a good plan, as it will be easy to add a new machine each month, and start indexing to a new database.

However, this way of partitioning the data is far from optimal, and in the case of the query mentioned above will not provide any performance gain at all. We can see why in the diagram below (RB boxes are Xapian remote backend servers):

distros2

Remember that the user was searching for “obama” in the date range 1-31 January 2009. Since Server 2 contains all the data for this month, and the other servers contain none, this means Server 2 has to do all the work – 10,000 disk reads as before. The end result is that the search is just as slow, and Servers 1 and 3 are idle for this query.

This sort of problem is likely to occur for any partitioning function which is not orthogonal (completely unrelated) to any variable which a user may use in a query. Say instead that the data is partitioned on news source name (Reuters, CNBC, BBC etc). A user may want to search in just one or two sources, in which case the load will again be unevenly distributed over the servers.

How then to partition the documents? One approach is to assign each a unique numerical ID (if not already assigned), divide this by the number of search servers, and take the remainder (mod function). If the remainder is 0, assign this document to the first server; if 1, to the second, and so on. This is shown in the diagram below:

distros31

Now, each server has an approximately equal number of blocks relevant to the query. Each server will therefore complete the query in a third of the time, and since this is in parallel, the overall search will be three times faster.

Any other orthogonal partitioning function would also be suitable, such as one based on a digest of the document content. However, a numerical ID is often the simplest. One problem with this partitioning style is that adding new machines is not such a straightforward procedure, and therefore it is simplest if the number of search nodes is decided at the beginning. Having said that, it is simple enough to repartition the databases if necessary.

We plan to make all of this automatic in a future release of Flax. In the meantime, don’t hesitate to get in touch with us if you have any questions about this or any other search topic.

Tags: ,

Posted in Technical

April 25th, 2009

No Comments »

Xapian Search Architecture

This is not strictly a Flax post, but is intended to clarify the Xapian search architecture for people using Xapian directly. It’s not intended for experienced Xapian hackers, neither is it a general introduction to using Xapian (see here instead).

The Xapian API is fairly complex, and there is often confusion about the role of the QueryParser, terms, document values, document data etc. in indexing and searching. It is probably worth pointing out that Xapian has nothing that resembles a “field” – that is a higher level abstraction relevant to Flax components. Xapian just has documents, identified by an integer ID, which can contain:

  • terms (usually words or short phrases, with optional positional information),
  • values (usually short strings, often containing binary data) and 
  • document data (any data, but often some text suitable for display).

These three types of object are independent, though often related at an application level (terms are often derived from the document data). At indexing time (to be covered in a future post) terms may be generated with an instance of TextProcessor, or they may be generated directly. Terms are designed for general text searching – does a particular word appear in a document? Values are used for a variety of other purposes such as range searching (e.g. dates), sort keys, and application-specific code. The document data cannot be used for searching; it is available for the application to use post-search, e.g. for displaying document information in a list of search results.

When documents are added to a Xapian database, the terms, values and document data are stored separately in structures designed for the optimum lookup speed in each case. This is shown in the diagram below:

Xapian Search Architecture

The search process here is shown in a greatly simplified conceptual form. But essentially, queries can match terms and/or values. A typical term query could be something like

llamas AND yaks

The Xapian matcher looks up the terms in the database and starts to collect a list of documents containing the terms. If the query consists only of terms, this list comprises the search results and is delivered to the client code as an MSet.

If, however, the query contains value range components (such as 1st January 2009 – 2nd April 2009), there is a second stage of searching. The matcher looks up the appropriate values for each of the documents in the candidate MSet and compares it to the query range. If it falls outside, the document is rejected, otherwise the document is passed on to the client.

Finally, it is possible to customise the search process by specifying a subclass of Xapian’s MatchDecider class. These have access to any of the document properties, but for reasons of performance usually utilise values. A typical application would be to filter search results according to a user’s security permissions. The match decider would compare the ACL of each candidate document with the user’s permissions, and reject or pass the document accordingly.

Search optimisation

Although this model is oversimplified, it can help to think about it when designing a database schema and search design. In general, term and position lists are much larger on the disk than values, and therefore the performance of term searching is limited by I/O. This is why having enough RAM available for disk caching (10% of the DB size or more) can greatly increase search performance.

By contrast, value range searches and match deciders are more computationally intensive and tend to be CPU-bound. Without any terms to limit the size of the candidate MSet, pure value searches have to grind through the whole set of values, and can be very slow for large databases.

If range searches are required, a big performance win can be gained by adding terms to partition the database. For example, dates may indexed by YYYYMM terms with a suitable prefix. Then a date range search can be augmented by a term search which cuts down the number of documents to process. E.g. 16th January 2009 – 2nd April 2009 would produce the term query:

(XD200901 OR XD200902 OR XD200903 OR XD200904)

This covers more of January and April than necessary, but this will be taken care of in the subsequent range search. And the term query will greatly reduce the number of documents which need to be processed, increasing the search speed.

However, the right level of granularity must be chosen. If we indexed individual days rather than months then the date term query would have 76 terms, and the performance would be at risk of becoming I/O-bound.

All of this is pretty complicated, which is why we’re adding it to Flax. Our aim is for Flax to automatically generate the optimum schema and query design for any set of input documents, with just a few high-level hints from the user. Keep watching here for news.

Tags:

Posted in Technical

April 2nd, 2009

6 Comments »