Archive for the ‘Uncategorized’ Category

Online Information 2009, day 1

I visited the Online Information exhibition yesterday at Olympia. My first impression was that the exhibition area was very quiet – and a few of the exhibitors agreed with me. The current financial situation would seem the obvious cause. At previous shows exhibitors have given away all kinds of freebies, from bags, to mini mice, to branded juggling balls….but this year you’d be lucky if you came away with a couple of free pens and a boiled sweet.

I dropped in on the associated conference later, and caught a presentation titled “The Real Time Web: Discovery vs. Search”. Antonio Gulli of Microsoft told us about their new European offices, including one in Soho, that were concentrating on bringing new features to Bing – but the results look very familiar, is Bing doomed to play catch-up? The only ‘real time’ feature he discussed was indexing Twitter, although apparently they’ll soon be indexing Facebook as well. Surely real time encompasses more than these two platforms?

Stephen Arnold gave us his thoughts on what we should mean by ‘real time’, sensibly talking about how the financial services world has been using real time systems for many years. He also injected some notes of caution about how difficult it is to trust information spread amongst peers on social networking sites – here’s a recent case, read further down the page for a great quote from Graham Cluley.

Someone from Endeca (I didn’t catch the name, he was replacing the published speaker) showed us lots of slides of various applications of search, but his theme seemed more about how search can replace traditional databases than about ‘real time’, something I’ve blogged about recently.

We finished with Conrad Wolfram, demonstrating Wolfram Alpha, which isn’t really a search engine but rather a computation engine – it tries to give you a set of answers, rather than a list of possible resources where the answer might be found. Not a lot of ‘real time’ here either.

I’m back on Thursday as part of the closing keynote panel.

Tags: ,

Posted in Uncategorized, events

December 2nd, 2009

No Comments »

Xapian compared

Vik Singh has been comparing various open source solutions for search. He only spent a weekend performing the comparison, which is probably not enough time to get any search software performing at its best, and his results reflect this. Xapian was marked down for being slow at indexing (he says 5x slower than SQLite – but then again, SQLite isn’t a search engine, it’s a RDBMS, and really isn’t suitable for search applications) and for producing large index files, much bigger than Lucene.

The reason for this is that Xapian stores different information to Lucene. For example, the full term list (un-inverted index) is retained, which makes it possible to do relevance feedback. Also, Lucene handles deletes by maintaining a separate list of deleted documents, which is merged at the next optimise step – which means that the internal statistics are wrong until this point, and that updates can be more complicated, as an updated document needs a new ID.

Neither approach is wrong and both have advantages – Lucene certainly has smaller index files. Some judicious use of the XAPIAN_FLUSH_THRESHOLD parameter, as suggested in some of the comments on the article, would have certainly speeded up Xapian indexing. We can also look forward to the release of the new Xapian ‘Chert’ backend, which will produce indexes at least 50% smaller than the current ‘Flint’ backend. It’s also hard to say how important index sizes are in these days of cheap storage.

On the search side, Xapian performed comparably to Lucene in terms of relevance and search speed (both were ahead of all the other solutions on these metrics, especially SQLite). There are some other metrics he quoted, such as a ’support’ figure, given as a score out of 5, which he admits is entirely subjective – you’d have to ask our customers about that one! There’s also no comparison of features, ease of integration and scalability to very large collections.

We’ve talked before about performance metrics. Vik should be applauded for his article and for releasing his test framework as open source, hopefully this can be a foundation for some more in-depth studies.

Perl client for Flax Search Server

Flax Search Server now has a Perl client, thanks to the guys at Cognidox, who have blogged about why they needed to improve the search facility for their powerful document management system.

Tags: , , , ,

Posted in Uncategorized

July 1st, 2009

No Comments »

Python and Flax presentation

My colleague Richard Boulton will be presenting at Europython in Birmingham, U.K. next week, specifically at 15.30 on Tuesday 30th June – an abstract is available. He’ll be talking about Xapian, Xappy and Flax, and showing examples of these in action including one using a Django integration layer.

Update: you can now download the slides for Richard’s talk in OpenOffice format.

Tags: , , , , ,

Posted in Uncategorized

June 25th, 2009

No Comments »

Flax Search Service alpha release

The Flax team are pleased to announce the alpha release of Flax Search Service (FSS). FSS combines powerful, high-level indexing and search features with a well-designed Web Services interface. FSS is Open Source software (under the MIT licence) and is available as a free download from Google Code.

Web Services and Service Oriented Architectures (SOA) have become increasingly popular in recent years due to their many advantages. FSS provides a RESTful interface in which databases, documents, and searches are represented as resources identified by URLs. For example, to add a document to a database,the document data is POSTed to the database resource. To search for a word or phrase,the client sends the query as a GET request to the database, which responds with a list of matching documents. Indexing transactions may be handled automatically or explicitly by the client.

For convenience, client libraries are being developed in several languages, including PHP, Python, Java and JavaScript. It would be a simple matter to interface to FSS in any language with support for Web protocols. The FSS distribution also includes example code to get you started, and basic documentation.

FSS alpha supports enough indexing and search functionality to implement basic but useful information retrieval systems. Over the next few months we will be adding support for advanced features like facets and tags, geolocation and image search. It will run on any system with support for Xapian and Python (Windows, Linux and Mac amongst others).

Tags: , ,

Posted in Uncategorized

June 3rd, 2009

No Comments »

Not so FAST…

Microsoft have announced a roadmap for their enterprise search products: none of this is very surprising. How successful they’ll be at integrating the FAST technology (which comes from a Linux background) with Sharepoint, .NET etc. remains to be seen. More coverage here.

They’ve also released an ‘Express’ (i.e., free but feature limited) version of Microsoft Search Server. We’re going to take a deeper look at this soon.

Tags: ,

Posted in Uncategorized

February 12th, 2009

No Comments »

Finding search engine people

I’ve spent some time recently trying to find where people gather and discuss different search engine technologies and approaches. There is a Yahoo group which seems friendly and full of useful content, and a group on LinkedIn, a business networking site. Stephen Arnold’s blog is also a mine of information, with profiles of vendors and some very interesting comments on particular technologies. I’ve also found some more blogs which I’ve added to the blogroll on the right.

As we continue to develop Flax, it’s very interesting to hear about customers and developers’ experience with other engines. If you know of any other places to look please let me know!

Posted in Uncategorized

February 2nd, 2009

1 Comment »

Introducing the Flax Blog

In concert with our new Flax website, we’ve decided to start blogging about development of Xapian and Flax, search technology in general, interesting open source projects and indeed anything else we can think of.

In this first post, I’ll try to explain a little about the motivation behind the Flax project. Here at Lemur Consulting we’ve worked with search engines for decades, starting with Muscat, then building a half-billion-page search for the Webtop project, to working with technologies such as Autonomy IDOL, Ultraseek and Lucene. We know a lot about the features customers need from search tools and how to build them. However, we’re also committed open source enthusiasts – and very few enterprise search engines are open source.

So, we feel the time is right for a complete open source enterprise search product. We’ve called this product Flax, and based it on the Xapian core (because we’re also heavily involved in Xapian, having helped develop it to support the aforementioned Webtop project). Flax is a combination of Xapian, various other programs we’ve developed over the years from spiders to indexers to content extractors, other complementary programs and our combined years of experience in the sector.

If you want to try Flax, right now, you can download Flax Basic, a free search tool for Windows. You could also see Flax in action searching millions of interior design items at mydeco or searching tens of millions of UK newspaper stories at NLA Clipsearch. If you want to know more about Flax and how it could help you build a powerful search tool, contact us.

We intend to develop Flax to rival or even surpass commercially available closed source search engines. Even at this early stage, the examples above prove this is a solid, scalable platform with a great future. It’s an exciting project and we’re glad to be able to share our story with you.

Tags: , ,

Posted in Uncategorized

January 14th, 2009

No Comments »