Posts Tagged ‘performance’

Questions to ask your search vendor

#1 – How does it work?
You’ll probably get as many different answers to this as there are vendors – but you may not get the whole truth. Bear in mind that a lot of search engines share what theoretical ideas they apply. An engine might use a vector-space or probabilistic models for ordering results, for example. Most will create an inverted index.

#2 – How fast is it?
Every search engine will take a finite amount of time to index a document or produce search results. Some of these processes will be limited by how fast data can be written to or read from disk, some by how fast the processor can do calculations. The key point is whether this time is going to work for you – will your users care if some complicated queries take ten seconds rather then a fraction of a second? Is there a time in the middle of the night when the system can spend a couple of hours building a new index? Watch out for silly answers such as “it’s instantaneous”.

#3 – How does it scale?
Whatever data you have today, you’ll have more tomorrow! How many servers will you need today, and how easy is it to add more in the future as necessary? Will this affect the speed of indexing or searching? Cloud-based solutions can help, especially when the amount of data or queries can be variable.

#4 – How much does it cost?
This is a question with several potential answers: the cost of a software license (of course, with open source code this can be zero), the cost of integration and customisation so the engine fits your requirements and the cost of ongoing support. Beware of a solution that promises much, but only after months of customisation. You should also ask how the cost scales with any growth in the number of source documents or users.

#5 – What happens if the vendor is taken over or disappears?
If the vendor is acquired by another company, or goes out of business, what happens to the software? The new owners may force you to move to their preferred solution, or in the worst case you can be left with no support for an obsolescent product. Ask if the vendor offers escrow. Open source licensing may also be a solution.

The above is not meant to be a complete list – feel free to suggest further questions!

Tags: , ,

Posted in Uncategorized

November 2nd, 2010

No Comments »

Xapian 1.2.0 arrives

Xapian 1.2.0, the first of a new ’stable’ release series, was announced a few weeks ago and we’ve just uploaded pre-built binaries for Windows and associated build files. You can find them on our Xapian downloads page.

This version features a new, faster, more compact database format and enhanced backwards compatibility with existing databases; a built-in replication system (so in a distributed system you only need to propagate the changes to a Xapian database across the network); a “Match Spy” interface to allow information about search results (such as facets) to be gathered efficiently; subclassable “Posting Sources” to allow extremely flexible search customisations and many more improvements and bug fixes. Nearly all of these improvements have been available previously in the 1.1 ‘development’ series – you can find out more about how development and stable releases differ on the Xapian RoadMap page.

Tags: , , ,

Posted in Technical

May 14th, 2010

4 Comments »

Online Information 2009, day 3

Back at Online 2009 on Thursday, to take part in the closing panel: “Cloud Computing, Open Source and Semantics: Content and Search Predictions”, moderated by Stephen Arnold. We only touched on four of the ten controversial themes Stephen had prepared: we talked a lot about how ‘Google pressure’ will affect the market, how XML isn’t necessarily the universal panacea for representing data, on the growth of rich media and the challenges it presents and finally on security. Some great questions from the floor as well, thanks to all who came and the organisers and Stephen for inviting us. I wish we’d had more time!

I didn’t agree with Stephen’s main point that Google will crush us all – I think the battles between Google and Microsoft (and Google and everyone else) are a distraction. While they’re fighting it out the rest of us can get on with developing cutting-edge search technologies. Open source search technology gives us tremendous flexibility, allows us to develop solutions very fast, allows the customer to take ownership of the system that’s being developed and now has comparable performance, scalability and commercial support to the traditional closed source world.

The real question is how this will affect the profitability of existing companies in the search space. I wonder who won’t be around at next year’s Online Information show…

Tags: , ,

Posted in Business, News

December 4th, 2009

No Comments »

Xapian compared

Vik Singh has been comparing various open source solutions for search. He only spent a weekend performing the comparison, which is probably not enough time to get any search software performing at its best, and his results reflect this. Xapian was marked down for being slow at indexing (he says 5x slower than SQLite – but then again, SQLite isn’t a search engine, it’s a RDBMS, and really isn’t suitable for search applications) and for producing large index files, much bigger than Lucene.

The reason for this is that Xapian stores different information to Lucene. For example, the full term list (un-inverted index) is retained, which makes it possible to do relevance feedback. Also, Lucene handles deletes by maintaining a separate list of deleted documents, which is merged at the next optimise step – which means that the internal statistics are wrong until this point, and that updates can be more complicated, as an updated document needs a new ID.

Neither approach is wrong and both have advantages – Lucene certainly has smaller index files. Some judicious use of the XAPIAN_FLUSH_THRESHOLD parameter, as suggested in some of the comments on the article, would have certainly speeded up Xapian indexing. We can also look forward to the release of the new Xapian ‘Chert’ backend, which will produce indexes at least 50% smaller than the current ‘Flint’ backend. It’s also hard to say how important index sizes are in these days of cheap storage.

On the search side, Xapian performed comparably to Lucene in terms of relevance and search speed (both were ahead of all the other solutions on these metrics, especially SQLite). There are some other metrics he quoted, such as a ’support’ figure, given as a score out of 5, which he admits is entirely subjective – you’d have to ask our customers about that one! There’s also no comparison of features, ease of integration and scalability to very large collections.

We’ve talked before about performance metrics. Vik should be applauded for his article and for releasing his test framework as open source, hopefully this can be a foundation for some more in-depth studies.

Distributed search and partition functions

For most applications, Xapian/Flax’s search performance will be excellent to acceptable on a single machine of reasonable spec (see here for a discussion of CPU and RAM requirements). However, if the document corpus is unusually large – more than about 20 million items – then one server may not be enough for acceptable speed. Xapian provides a mechanism called remote backends which lets the load be shared over several machines, and thus increases the performance. Using this technique, scalability is effectively limitless (hardware budget allowing!) It is sometimes known as sharding.

To illustrate, let’s take a hypothetical news archive as our example. This collects news stories and blog posts from a wide range of sources, adds them to a Xapian index, and allows users to search the archive. For the sake of argument, we’ll say it accumulates about 20 million items per month, and that it started on December 2008. Users can search the story text, and optionally restrict the search to a date range, news source etc.

Ignoring the fine details, this is what data flow would look like on a single machine:

distros1

The current user is searching for “obama” in the date range 1-31 January 2009. Disk blocks which are relevant to this search are shown as “B”, while irrelevant blocks are shown as “b” (only a tiny sample of blocks is illustrated).

Again, for the sake of argument, let’s say this search has to read 10,000 blocks in order to retrieve the result set, taking a few seconds. This is unacceptably slow, so the archive administrators decide to distribute the search over multiple machines, using the Xapian remote backend. They use the documentation here to set up three search servers (to begin with), and put data for December 2008 on the first, Jaunary 2009 on the second, and February 2009 on the third. This seems like a good plan, as it will be easy to add a new machine each month, and start indexing to a new database.

However, this way of partitioning the data is far from optimal, and in the case of the query mentioned above will not provide any performance gain at all. We can see why in the diagram below (RB boxes are Xapian remote backend servers):

distros2

Remember that the user was searching for “obama” in the date range 1-31 January 2009. Since Server 2 contains all the data for this month, and the other servers contain none, this means Server 2 has to do all the work – 10,000 disk reads as before. The end result is that the search is just as slow, and Servers 1 and 3 are idle for this query.

This sort of problem is likely to occur for any partitioning function which is not orthogonal (completely unrelated) to any variable which a user may use in a query. Say instead that the data is partitioned on news source name (Reuters, CNBC, BBC etc). A user may want to search in just one or two sources, in which case the load will again be unevenly distributed over the servers.

How then to partition the documents? One approach is to assign each a unique numerical ID (if not already assigned), divide this by the number of search servers, and take the remainder (mod function). If the remainder is 0, assign this document to the first server; if 1, to the second, and so on. This is shown in the diagram below:

distros31

Now, each server has an approximately equal number of blocks relevant to the query. Each server will therefore complete the query in a third of the time, and since this is in parallel, the overall search will be three times faster.

Any other orthogonal partitioning function would also be suitable, such as one based on a digest of the document content. However, a numerical ID is often the simplest. One problem with this partitioning style is that adding new machines is not such a straightforward procedure, and therefore it is simplest if the number of search nodes is decided at the beginning. Having said that, it is simple enough to repartition the databases if necessary.

We plan to make all of this automatic in a future release of Flax. In the meantime, don’t hesitate to get in touch with us if you have any questions about this or any other search topic.

Tags: ,

Posted in Technical

April 25th, 2009

2 Comments »

More on performance metrics

Anurag Goel recently carried out a comparitive test of Xapian/Flax and Lucene/Solr. Some interesting results here: it seems Lucene is faster at building indexes, but Xapian is faster and possibly more accurate at searching. We can expect some further speed improvements over the next few months as a new, more compact backend to Xapian is released.

By the way, the article mentions Xappy: this is a Python interface to Xapian that is a major part of our Flax enterprise search platform. You can get Xappy here.

Tags: , , , ,

Posted in Technical

March 13th, 2009

2 Comments »

Performance metrics

Stephen Arnold recently posted some rather impressive performance figures for Autonomy’s IDOL search engine. This kind of data is all very well, but without independent testing and more detail it’s hard to know how these figures apply to the real world.

So here’s an idea. Why not create an openly available collection of test data, a set of searches and a set of conditions, then compare the performance of the various available engines for indexing and searching? Recording the software and hardware used as well, of course. Making the data and conditions public would allow for independent verification.

I’m not sure commercial search vendors would ever agree to this, but it’s a nice idea.

Tags: , ,

Posted in Technical

March 4th, 2009

1 Comment »