Posts Tagged ‘xapian’
So you’re writing a search-related application in your favourite language, and you’ve decided to choose an open source search engine to power it. So far, so good – but how are the two going to communicate?
Let’s look at two engines, Xapian and Lucene, and compare how this might be done. Lucene is written in Java, Xapian in C/C++ – so if you’re using those languages respectively, everything should be relatively simple – just download the source code and get on with it. However if this isn’t the case, you’re going to have to work out how to interface to the engine.
The Lucene project has been rewritten in several other languages: for C/C++ there’s Lucy (which includes Perl and Ruby bindings), for Python there’s PyLucene, and there’s even a .Net version called, not surprisingly, Lucene.NET. Some of these ‘ports’ of Lucene are ‘looser’ than others (i.e. they may not share the same API or feature set), and they may not be updated as often as Lucene itself. There are also versions in Perl, Ruby, Delphi or even Lisp (scary!) – there’s a full list available. Not all are currently active projects.
Xapian takes a different approach, with only one core project, but a sheaf of bindings to other languages. Currently these bindings cover C#, Java, Perl, PHP, Python, Ruby and Tcl – but interestingly these are auto-generated using the Simplified Wrapper and Interface Generator or SWIG. This means that every time Xapian’s API changes, the bindings can easily be updated to reflect this (it’s actually not quite that simple, but SWIG copes with the vast majority of code that would otherwise have to be manually edited). SWIG actually supports other languages as well (according to the SWIG website, “Common Lisp (CLISP, Allegro CL, CFFI, UFFI), Lua, Modula-3, OCAML, Octave and R. Also several interpreted and compiled Scheme implementations (Guile, MzScheme, Chicken)”) so in theory bindings to these could also be built relatively easily.
There’s also another way to communicate with both engines, using a search server. SOLR is the search server for Lucene, whereas for Xapian there is Flax Search Service. In this case, any language that supports Web Services (you’d be hard pressed to find a modern language that doesn’t) can communicate with the engine, simply passing data over the HTTP protocol.
Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.
Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.
The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
and this code indexes some text and a datetime:
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()
Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).
Indexing can also be performed by the Action framework. In this case, a text file contains a list of:
- external identifiers (such as XPaths, SQL column name etc)
- flax fieldname
- indexing actions
For example, an actions file for XML might look like this:
.//metadata[@name='Author']/@value
author: filter(facet)
author2: index(default)
.//metadata[@name='Year']/@value
published: numeric
This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.
The flaxcode repository contains two example flax.core applications here:
applications/flax_core_examples
One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.
With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.
Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.
The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.
We’re not alone in our preference for Python of course!
Xapian 1.2.0, the first of a new ’stable’ release series, was announced a few weeks ago and we’ve just uploaded pre-built binaries for Windows and associated build files. You can find them on our Xapian downloads page.
This version features a new, faster, more compact database format and enhanced backwards compatibility with existing databases; a built-in replication system (so in a distributed system you only need to propagate the changes to a Xapian database across the network); a “Match Spy” interface to allow information about search results (such as facets) to be gathered efficiently; subclassable “Posting Sources” to allow extremely flexible search customisations and many more improvements and bug fixes. Nearly all of these improvements have been available previously in the 1.1 ‘development’ series – you can find out more about how development and stable releases differ on the Xapian RoadMap page.
We sponsored Open Source Search Cambridge last week, which went very well, with attendees from as far away as Tokyo and New Zealand, a great variety of talks, presentation and networking and some excellent food!
Shane Evans from mydeco gave a detailed talk on Creating a product search engine, with some interesting details on how query-independent weights are calculate. He was followed by Olly Betts on How Gmane is implemented using Xapian – 72 million messages indexed on a single server! We also had talks from those involved with the Cheshire3 XML search engine, PuppyIR, project to develop search frameworks for children, and found out more about how Glasses Direct have implemented their search using SOLR.
The afternoon consisted of a number of well-attended seminars on search topics, such as comparisons of the various open source search engines available. The day ended with informal networking in a nearby pub.
Based on the feedback we got, there’s definitely interest in a similar event next year – watch this space.
Update: sounds like Search Solutions 2009 was also a good day.
We’re sponsoring a one-day event on open source search – details here, there will be more announced soon. Hope some of you can make it!
Our technical partners Cognidox have released a whitepaper detailing their view of the enterprise search market, titled “Why you can’t just ‘Google’ for Enterprise Knowledge” – it’s well worth a read. You can download the PDF from their archive.
Vik Singh has been comparing various open source solutions for search. He only spent a weekend performing the comparison, which is probably not enough time to get any search software performing at its best, and his results reflect this. Xapian was marked down for being slow at indexing (he says 5x slower than SQLite – but then again, SQLite isn’t a search engine, it’s a RDBMS, and really isn’t suitable for search applications) and for producing large index files, much bigger than Lucene.
The reason for this is that Xapian stores different information to Lucene. For example, the full term list (un-inverted index) is retained, which makes it possible to do relevance feedback. Also, Lucene handles deletes by maintaining a separate list of deleted documents, which is merged at the next optimise step – which means that the internal statistics are wrong until this point, and that updates can be more complicated, as an updated document needs a new ID.
Neither approach is wrong and both have advantages – Lucene certainly has smaller index files. Some judicious use of the XAPIAN_FLUSH_THRESHOLD parameter, as suggested in some of the comments on the article, would have certainly speeded up Xapian indexing. We can also look forward to the release of the new Xapian ‘Chert’ backend, which will produce indexes at least 50% smaller than the current ‘Flint’ backend. It’s also hard to say how important index sizes are in these days of cheap storage.
On the search side, Xapian performed comparably to Lucene in terms of relevance and search speed (both were ahead of all the other solutions on these metrics, especially SQLite). There are some other metrics he quoted, such as a ’support’ figure, given as a score out of 5, which he admits is entirely subjective – you’d have to ask our customers about that one! There’s also no comparison of features, ease of integration and scalability to very large collections.
We’ve talked before about performance metrics. Vik should be applauded for his article and for releasing his test framework as open source, hopefully this can be a foundation for some more in-depth studies.
My colleague Richard Boulton will be presenting at Europython in Birmingham, U.K. next week, specifically at 15.30 on Tuesday 30th June – an abstract is available. He’ll be talking about Xapian, Xappy and Flax, and showing examples of these in action including one using a Django integration layer.
Update: you can now download the slides for Richard’s talk in OpenOffice format.
The Flax team are pleased to announce the alpha release of Flax Search Service (FSS). FSS combines powerful, high-level indexing and search features with a well-designed Web Services interface. FSS is Open Source software (under the MIT licence) and is available as a free download from Google Code.
Web Services and Service Oriented Architectures (SOA) have become increasingly popular in recent years due to their many advantages. FSS provides a RESTful interface in which databases, documents, and searches are represented as resources identified by URLs. For example, to add a document to a database,the document data is POSTed to the database resource. To search for a word or phrase,the client sends the query as a GET request to the database, which responds with a list of matching documents. Indexing transactions may be handled automatically or explicitly by the client.
For convenience, client libraries are being developed in several languages, including PHP, Python, Java and JavaScript. It would be a simple matter to interface to FSS in any language with support for Web protocols. The FSS distribution also includes example code to get you started, and basic documentation.
FSS alpha supports enough indexing and search functionality to implement basic but useful information retrieval systems. Over the next few months we will be adding support for advanced features like facets and tags, geolocation and image search. It will run on any system with support for Xapian and Python (Windows, Linux and Mac amongst others).