Archive for March, 2009

Search requirements and asking the right questions

When we’re contacted by potential clients, we have to gather as much information as possible about how and why they need search technology. This either takes the form of a physical or telephone meeting and much scribbling in notebooks, or a long exchange of emails. In all cases there are some important questions that must be answered, and I thought it might be useful to list the most common ones here:

How many items do you need to search?

The number of items to search varies widely, from a few thousand to hundreds of millions. This number impacts both the eventual size of a searchable index and how fast it can be built, and will thus inform the eventual system design, both in hardware and software terms. It’s usually possible to¬† search from 5 to 50 million items on a single server – but this also depends on the answer to the next questions:

How big/complex are the items to be searched?

This includes both the size of each item and what data it contains: for example does each item contain a price, or a characteristic like an author’s name, or colour. The item can be part of a group of items, have user tags applied, or be restricted to a certain group of users. The searchable index we build will have to take account of all this information in the correct way, so we can search it effectively.

What other systems must the search engine work with?

Sometimes search engines will have to fit into an existing infrastructure – say an intranet or web application framework – and sometimes they will have to extract information from another system, such as a relational database. The engine may also have to take account of existing security systems, which can impact how each search result is delivered. It may have to deliver search results as a web page, or as a report, or as an email. There’s obviously a huge variety of possible systems to interact with, not least the operating system or platform.

What’s your schedule for delivering a search solution?

This is another key point – it can be relatively quick to build a simple search application, but if the system is going to be very large or very complex, or if a staged delivery based on user feedback is required, then it’s important to know what the expectations are. We’ve installed systems in a couple of days, and built more complex ones over years.

In all cases it’s important to realise that every client will have differing requirements and expectations, and to be sure that everyone ends up satisfied with the end result, the more information we can gather at the start of the process, the better.

Tags: , ,

Posted in Business

March 19th, 2009

No Comments »

More on performance metrics

Anurag Goel recently carried out a comparitive test of Xapian/Flax and Lucene/Solr. Some interesting results here: it seems Lucene is faster at building indexes, but Xapian is faster and possibly more accurate at searching. We can expect some further speed improvements over the next few months as a new, more compact backend to Xapian is released.

By the way, the article mentions Xappy: this is a Python interface to Xapian that is a major part of our Flax enterprise search platform. You can get Xappy here.

Tags: , , , ,

Posted in Technical

March 13th, 2009

2 Comments »

Image searching

Searching images is a difficult problem, and it’s not a feature offered by many commercial search engines. Some will cheat slightly, by indexing the title or filename of the image, or the text surrounding an image embedded on a page, and call this ‘image search’ – but this method doesn’t work very well, especially when you have a standalone image called ‘IMG0000064.jpg’ which is actually a picture of an apple. We’ve seen some good demos of actual image search – Imense is particularly impressive – but none that promise a generic solution that will work with all images.

In the meantime we’ve been developing some image related search technology for one of our clients, and we can now offer image similarity matching as part of Flax – you can read more about this exciting development on the Searching with Xapian blog, written by my colleague Richard Boulton.

Tags: , ,

Posted in Technical

March 11th, 2009

No Comments »

Performance metrics

Stephen Arnold recently posted some rather impressive performance figures for Autonomy’s IDOL search engine. This kind of data is all very well, but without independent testing and more detail it’s hard to know how these figures apply to the real world.

So here’s an idea. Why not create an openly available collection of test data, a set of searches and a set of conditions, then compare the performance of the various available engines for indexing and searching? Recording the software and hardware used as well, of course. Making the data and conditions public would allow for independent verification.

I’m not sure commercial search vendors would ever agree to this, but it’s a nice idea.

Tags: , ,

Posted in Technical

March 4th, 2009

1 Comment »