Posts Tagged ‘xapian’

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Packaged solutions and customisability, the Python way

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

Tags: , , , ,

Posted in Technical

June 14th, 2010

No Comments »

Xapian 1.2.0 arrives

Xapian 1.2.0, the first of a new ’stable’ release series, was announced a few weeks ago and we’ve just uploaded pre-built binaries for Windows and associated build files. You can find them on our Xapian downloads page.

This version features a new, faster, more compact database format and enhanced backwards compatibility with existing databases; a built-in replication system (so in a distributed system you only need to propagate the changes to a Xapian database across the network); a “Match Spy” interface to allow information about search results (such as facets) to be gathered efficiently; subclassable “Posting Sources” to allow extremely flexible search customisations and many more improvements and bug fixes. Nearly all of these improvements have been available previously in the 1.1 ‘development’ series – you can find out more about how development and stable releases differ on the Xapian RoadMap page.

Tags: , , ,

Posted in Technical

May 14th, 2010

4 Comments »

Open Source Search Event

We sponsored Open Source Search Cambridge last week, which went very well, with attendees from as far away as Tokyo and New Zealand, a great variety of talks, presentation and networking and some excellent food!

Shane Evans from mydeco gave a detailed talk on Creating a product search engine, with some interesting details on how query-independent weights are calculate. He was followed by Olly Betts on How Gmane is implemented using Xapian – 72 million messages indexed on a single server! We also had talks from those involved with the Cheshire3 XML search engine, PuppyIR, project to develop search frameworks for children, and found out more about how Glasses Direct have implemented their search using SOLR.

The afternoon consisted of a number of well-attended seminars on search topics, such as comparisons of the various open source search engines available. The day ended with informal networking in a nearby pub.

Based on the feedback we got, there’s definitely interest in a similar event next year – watch this space.

Update: sounds like Search Solutions 2009 was also a good day.

Tags: , , ,

Posted in events

October 6th, 2009

1 Comment »

Open Source Search event in Cambridge on 29th September

We’re sponsoring a one-day event on open source search – details here, there will be more announced soon. Hope some of you can make it!

Tags: , , ,

Posted in News

July 27th, 2009

No Comments »

Whitepaper on enterprise search

Our technical partners Cognidox have released a whitepaper detailing their view of the enterprise search market, titled “Why you can’t just ‘Google’ for Enterprise Knowledge” – it’s well worth a read. You can download the PDF from their archive.

Tags: , , ,

Posted in News

July 13th, 2009

No Comments »

Xapian compared

Vik Singh has been comparing various open source solutions for search. He only spent a weekend performing the comparison, which is probably not enough time to get any search software performing at its best, and his results reflect this. Xapian was marked down for being slow at indexing (he says 5x slower than SQLite – but then again, SQLite isn’t a search engine, it’s a RDBMS, and really isn’t suitable for search applications) and for producing large index files, much bigger than Lucene.

The reason for this is that Xapian stores different information to Lucene. For example, the full term list (un-inverted index) is retained, which makes it possible to do relevance feedback. Also, Lucene handles deletes by maintaining a separate list of deleted documents, which is merged at the next optimise step – which means that the internal statistics are wrong until this point, and that updates can be more complicated, as an updated document needs a new ID.

Neither approach is wrong and both have advantages – Lucene certainly has smaller index files. Some judicious use of the XAPIAN_FLUSH_THRESHOLD parameter, as suggested in some of the comments on the article, would have certainly speeded up Xapian indexing. We can also look forward to the release of the new Xapian ‘Chert’ backend, which will produce indexes at least 50% smaller than the current ‘Flint’ backend. It’s also hard to say how important index sizes are in these days of cheap storage.

On the search side, Xapian performed comparably to Lucene in terms of relevance and search speed (both were ahead of all the other solutions on these metrics, especially SQLite). There are some other metrics he quoted, such as a ’support’ figure, given as a score out of 5, which he admits is entirely subjective – you’d have to ask our customers about that one! There’s also no comparison of features, ease of integration and scalability to very large collections.

We’ve talked before about performance metrics. Vik should be applauded for his article and for releasing his test framework as open source, hopefully this can be a foundation for some more in-depth studies.

Python and Flax presentation

My colleague Richard Boulton will be presenting at Europython in Birmingham, U.K. next week, specifically at 15.30 on Tuesday 30th June – an abstract is available. He’ll be talking about Xapian, Xappy and Flax, and showing examples of these in action including one using a Django integration layer.

Update: you can now download the slides for Richard’s talk in OpenOffice format.

Tags: , , , , ,

Posted in Uncategorized

June 25th, 2009

No Comments »

Flax Search Service alpha release

The Flax team are pleased to announce the alpha release of Flax Search Service (FSS). FSS combines powerful, high-level indexing and search features with a well-designed Web Services interface. FSS is Open Source software (under the MIT licence) and is available as a free download from Google Code.

Web Services and Service Oriented Architectures (SOA) have become increasingly popular in recent years due to their many advantages. FSS provides a RESTful interface in which databases, documents, and searches are represented as resources identified by URLs. For example, to add a document to a database,the document data is POSTed to the database resource. To search for a word or phrase,the client sends the query as a GET request to the database, which responds with a list of matching documents. Indexing transactions may be handled automatically or explicitly by the client.

For convenience, client libraries are being developed in several languages, including PHP, Python, Java and JavaScript. It would be a simple matter to interface to FSS in any language with support for Web protocols. The FSS distribution also includes example code to get you started, and basic documentation.

FSS alpha supports enough indexing and search functionality to implement basic but useful information retrieval systems. Over the next few months we will be adding support for advanced features like facets and tags, geolocation and image search. It will run on any system with support for Xapian and Python (Windows, Linux and Mac amongst others).

Tags: , ,

Posted in Uncategorized

June 3rd, 2009

No Comments »

Distributed search and partition functions

For most applications, Xapian/Flax’s search performance will be excellent to acceptable on a single machine of reasonable spec (see here for a discussion of CPU and RAM requirements). However, if the document corpus is unusually large – more than about 20 million items – then one server may not be enough for acceptable speed. Xapian provides a mechanism called remote backends which lets the load be shared over several machines, and thus increases the performance. Using this technique, scalability is effectively limitless (hardware budget allowing!) It is sometimes known as sharding.

To illustrate, let’s take a hypothetical news archive as our example. This collects news stories and blog posts from a wide range of sources, adds them to a Xapian index, and allows users to search the archive. For the sake of argument, we’ll say it accumulates about 20 million items per month, and that it started on December 2008. Users can search the story text, and optionally restrict the search to a date range, news source etc.

Ignoring the fine details, this is what data flow would look like on a single machine:

distros1

The current user is searching for “obama” in the date range 1-31 January 2009. Disk blocks which are relevant to this search are shown as “B”, while irrelevant blocks are shown as “b” (only a tiny sample of blocks is illustrated).

Again, for the sake of argument, let’s say this search has to read 10,000 blocks in order to retrieve the result set, taking a few seconds. This is unacceptably slow, so the archive administrators decide to distribute the search over multiple machines, using the Xapian remote backend. They use the documentation here to set up three search servers (to begin with), and put data for December 2008 on the first, Jaunary 2009 on the second, and February 2009 on the third. This seems like a good plan, as it will be easy to add a new machine each month, and start indexing to a new database.

However, this way of partitioning the data is far from optimal, and in the case of the query mentioned above will not provide any performance gain at all. We can see why in the diagram below (RB boxes are Xapian remote backend servers):

distros2

Remember that the user was searching for “obama” in the date range 1-31 January 2009. Since Server 2 contains all the data for this month, and the other servers contain none, this means Server 2 has to do all the work – 10,000 disk reads as before. The end result is that the search is just as slow, and Servers 1 and 3 are idle for this query.

This sort of problem is likely to occur for any partitioning function which is not orthogonal (completely unrelated) to any variable which a user may use in a query. Say instead that the data is partitioned on news source name (Reuters, CNBC, BBC etc). A user may want to search in just one or two sources, in which case the load will again be unevenly distributed over the servers.

How then to partition the documents? One approach is to assign each a unique numerical ID (if not already assigned), divide this by the number of search servers, and take the remainder (mod function). If the remainder is 0, assign this document to the first server; if 1, to the second, and so on. This is shown in the diagram below:

distros31

Now, each server has an approximately equal number of blocks relevant to the query. Each server will therefore complete the query in a third of the time, and since this is in parallel, the overall search will be three times faster.

Any other orthogonal partitioning function would also be suitable, such as one based on a digest of the document content. However, a numerical ID is often the simplest. One problem with this partitioning style is that adding new machines is not such a straightforward procedure, and therefore it is simplest if the number of search nodes is decided at the beginning. Having said that, it is simple enough to repartition the databases if necessary.

We plan to make all of this automatic in a future release of Flax. In the meantime, don’t hesitate to get in touch with us if you have any questions about this or any other search topic.

Tags: ,

Posted in Technical

April 25th, 2009

No Comments »