This October I’ve been invited to speak at Lucene Revolution, a conference on open source search to be held in Boston, USA. I’ll be part of the closing panel on October 8th, together with speakers from Lucid Imagination and Exalead. It looks like a very interesting event, with speakers from IBM, Cisco, LinkedIn and the Smithsonian.
As part of the run-up to the conference Stephen Arnold has interviewed me – we discussed the wider picture of open source search, why a strong community is important and why flexibility can be the key to successful integration.
We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:
import crawler
crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()
Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.
We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.
Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.
Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.
The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:
import xapian
import flax.core
db = xapian.WritableDatabase('db', xapian.DB_CREATE)
fm = flax.core.Fieldmap()
fm.language = 'en' # stem for English
fm.setfield('mytext', False) # freetext field
fm.setfield('mydate', True) # filter field
fm.save(db)
and this code indexes some text and a datetime:
doc = fm.document()
doc.index('mytext', "I don't like spam.")
doc.index('mydate', datetime(2010, 2, 3, 12, 0))
fm.add_document(db, doc)
db.flush()
Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).
Indexing can also be performed by the Action framework. In this case, a text file contains a list of:
- external identifiers (such as XPaths, SQL column name etc)
- flax fieldname
- indexing actions
For example, an actions file for XML might look like this:
.//metadata[@name='Author']/@value
author: filter(facet)
author2: index(default)
.//metadata[@name='Year']/@value
published: numeric
This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.
The flaxcode repository contains two example flax.core applications here:
applications/flax_core_examples
One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.
With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.
Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.
The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.
We’re not alone in our preference for Python of course!
Xapian 1.2.0, the first of a new ’stable’ release series, was announced a few weeks ago and we’ve just uploaded pre-built binaries for Windows and associated build files. You can find them on our Xapian downloads page.
This version features a new, faster, more compact database format and enhanced backwards compatibility with existing databases; a built-in replication system (so in a distributed system you only need to propagate the changes to a Xapian database across the network); a “Match Spy” interface to allow information about search results (such as facets) to be gathered efficiently; subclassable “Posting Sources” to allow extremely flexible search customisations and many more improvements and bug fixes. Nearly all of these improvements have been available previously in the 1.1 ‘development’ series – you can find out more about how development and stable releases differ on the Xapian RoadMap page.
News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.
The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.
We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.
We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.
Feedback would of course be very welcome.
Last week we heard from various sources that Microsoft had announced they would only be continuing to develop its recently acquired FAST Search technology on Windows. This had long been feared by some in the sector, and it must be worrying for existing customers.
Platform choice can be a key issue for those looking to implement advanced search, as there may be significant existing in-house expertise and investment in a particular platform. Our Flax solution works just as well on Windows, Linux or Solaris. It’s sad to see such a powerful technology as FAST become so narrow in focus, but it’s not particularly surprising after the Microsoft acquisition.
UPDATE: more coverage on this from The Register
Here are two relatively new networking groups – these are informal gatherings of those who work with enterprise search. I’ve been to the first one and it was very interesting.
London Open Source Social – for those working with open-source enterprise search
Enterprise Search London – more generally for those working in enterprise search
Back at Online 2009 on Thursday, to take part in the closing panel: “Cloud Computing, Open Source and Semantics: Content and Search Predictions”, moderated by Stephen Arnold. We only touched on four of the ten controversial themes Stephen had prepared: we talked a lot about how ‘Google pressure’ will affect the market, how XML isn’t necessarily the universal panacea for representing data, on the growth of rich media and the challenges it presents and finally on security. Some great questions from the floor as well, thanks to all who came and the organisers and Stephen for inviting us. I wish we’d had more time!
I didn’t agree with Stephen’s main point that Google will crush us all – I think the battles between Google and Microsoft (and Google and everyone else) are a distraction. While they’re fighting it out the rest of us can get on with developing cutting-edge search technologies. Open source search technology gives us tremendous flexibility, allows us to develop solutions very fast, allows the customer to take ownership of the system that’s being developed and now has comparable performance, scalability and commercial support to the traditional closed source world.
The real question is how this will affect the profitability of existing companies in the search space. I wonder who won’t be around at next year’s Online Information show…