Archive for the ‘Technical’ Category

flax.crawler arrives

We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:

import crawler

crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()

Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.

We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.

Tags: , , ,

Posted in Technical

August 2nd, 2010

No Comments »

Log analysis and adaptive search

I attended an interesting talk by Udo Kruschwitz on Adaptive Intranet Search last night as part of the Enterprise Search London Meetup. Udo has built a search engine for the University of Essex and has been investigating how to help users to refine their query using techniques such as suggesting related terms (there’s a similar feature in Xapian called ‘top terms’ – here’s an example). As part of this he’s done a great deal of analysis of query and session logs, and is building up expertise on automatically maintained domain knowledge – moving away from the traditional model of manually maintained networks relating one word or phrase to another. For example, his system is learning automatically that when users type “map” into the search box, they really want to search for “campus map”. The number of documents in his test collection is small and the volume of searches is low; it will be interesting to see how these ideas scale to larger collections and groups of users.

The group was small, informal and seemed to consist mainly of those with expertise in implementing search solutions – no sales or marketing here, just a group of people discussing how best to get the job done.

Tags: , ,

Posted in Technical, events

July 30th, 2010

No Comments »

flax.core 0.1 available

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

  • external identifiers (such as XPaths,  SQL column name etc)
  • flax fieldname
  • indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2′ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

Tags: , , , ,

Posted in Technical

June 24th, 2010

No Comments »

Packaged solutions and customisability, the Python way

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

Tags: , , , ,

Posted in Technical

June 14th, 2010

No Comments »

Xapian 1.2.0 arrives

Xapian 1.2.0, the first of a new ’stable’ release series, was announced a few weeks ago and we’ve just uploaded pre-built binaries for Windows and associated build files. You can find them on our Xapian downloads page.

This version features a new, faster, more compact database format and enhanced backwards compatibility with existing databases; a built-in replication system (so in a distributed system you only need to propagate the changes to a Xapian database across the network); a “Match Spy” interface to allow information about search results (such as facets) to be gathered efficiently; subclassable “Posting Sources” to allow extremely flexible search customisations and many more improvements and bug fixes. Nearly all of these improvements have been available previously in the 1.1 ‘development’ series – you can find out more about how development and stable releases differ on the Xapian RoadMap page.

Tags: , , ,

Posted in Technical

May 14th, 2010

4 Comments »

Some new open source file filters & previewers

We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.

We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.

Feedback would of course be very welcome.

Tags: , ,

Posted in Technical

March 12th, 2010

No Comments »

When real-time search isn’t

Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.

From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.

A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.

The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.

In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.

Tags: ,

Posted in News, Technical

November 5th, 2009

2 Comments »

Replacing relational databases with search engines for simple lookups

One of the things we often notice about existing systems based on relational databases (RDB) is that as they scale to millions of items, simple lookup tasks become slow and inefficient. These tasks don’t usually require complicated database operations, so in most cases it is possible to relocate the data from the RDB into a search engine like Flax.

Consider a system where a search engine has already been implemented to search textual product information, but numerical data on each product, such as price, is still being stored in a RDB. Users will often need filters on search results such as ’show me items under £10′ and so a RDB operation similar to ‘SELECT productID FROM products WHERE price<£10‘ will be needed, in addition to the search engine query. Modern search engines like Flax implement range search functions, so that numerical information can be added to documents, and it is thus possible to carry out this operation in the search engine as part of the full-text search for the product information.

We’ve noticed with several clients that it is now possible to move all their data from the original RDB into the search engine. This can obviously lead to cost savings, as only one system must be hosted, maintained and backed up, and scaling out can be far simpler.

Another way to look at this is to consider a search engine as an example of a document-oriented database.

Tags: ,

Posted in Technical

August 27th, 2009

No Comments »

Distributed search and partition functions

For most applications, Xapian/Flax’s search performance will be excellent to acceptable on a single machine of reasonable spec (see here for a discussion of CPU and RAM requirements). However, if the document corpus is unusually large – more than about 20 million items – then one server may not be enough for acceptable speed. Xapian provides a mechanism called remote backends which lets the load be shared over several machines, and thus increases the performance. Using this technique, scalability is effectively limitless (hardware budget allowing!) It is sometimes known as sharding.

To illustrate, let’s take a hypothetical news archive as our example. This collects news stories and blog posts from a wide range of sources, adds them to a Xapian index, and allows users to search the archive. For the sake of argument, we’ll say it accumulates about 20 million items per month, and that it started on December 2008. Users can search the story text, and optionally restrict the search to a date range, news source etc.

Ignoring the fine details, this is what data flow would look like on a single machine:

distros1

The current user is searching for “obama” in the date range 1-31 January 2009. Disk blocks which are relevant to this search are shown as “B”, while irrelevant blocks are shown as “b” (only a tiny sample of blocks is illustrated).

Again, for the sake of argument, let’s say this search has to read 10,000 blocks in order to retrieve the result set, taking a few seconds. This is unacceptably slow, so the archive administrators decide to distribute the search over multiple machines, using the Xapian remote backend. They use the documentation here to set up three search servers (to begin with), and put data for December 2008 on the first, Jaunary 2009 on the second, and February 2009 on the third. This seems like a good plan, as it will be easy to add a new machine each month, and start indexing to a new database.

However, this way of partitioning the data is far from optimal, and in the case of the query mentioned above will not provide any performance gain at all. We can see why in the diagram below (RB boxes are Xapian remote backend servers):

distros2

Remember that the user was searching for “obama” in the date range 1-31 January 2009. Since Server 2 contains all the data for this month, and the other servers contain none, this means Server 2 has to do all the work – 10,000 disk reads as before. The end result is that the search is just as slow, and Servers 1 and 3 are idle for this query.

This sort of problem is likely to occur for any partitioning function which is not orthogonal (completely unrelated) to any variable which a user may use in a query. Say instead that the data is partitioned on news source name (Reuters, CNBC, BBC etc). A user may want to search in just one or two sources, in which case the load will again be unevenly distributed over the servers.

How then to partition the documents? One approach is to assign each a unique numerical ID (if not already assigned), divide this by the number of search servers, and take the remainder (mod function). If the remainder is 0, assign this document to the first server; if 1, to the second, and so on. This is shown in the diagram below:

distros31

Now, each server has an approximately equal number of blocks relevant to the query. Each server will therefore complete the query in a third of the time, and since this is in parallel, the overall search will be three times faster.

Any other orthogonal partitioning function would also be suitable, such as one based on a digest of the document content. However, a numerical ID is often the simplest. One problem with this partitioning style is that adding new machines is not such a straightforward procedure, and therefore it is simplest if the number of search nodes is decided at the beginning. Having said that, it is simple enough to repartition the databases if necessary.

We plan to make all of this automatic in a future release of Flax. In the meantime, don’t hesitate to get in touch with us if you have any questions about this or any other search topic.

Tags: ,

Posted in Technical

April 25th, 2009

No Comments »

Flax stack and pre-built binaries

We’ve updated the Flax website with a page showing the Flax software stack – hopefully this will go some way towards explaining how Xapian, Xappy and parts of Flax all fit together. There’s still lots in development so expect some more news later this month.

As part of this, we’ve created a new page bringing together all the Win32 files for Xapian that we maintain – including some pre-built binaries for those of you who don’t want to compile Xapian yourself. We’re working on creating one-click installable packages for bindings for the various languages – however at present we’ve only finished this for Python. Hopefully some users of the other languages will let us know how best to present the other bindings.

Tags: , ,

Posted in News, Technical

April 7th, 2009

No Comments »