Archive for November, 2012

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Following the money….all the way to open source search.

There’s an old saying that to find out what’s really going on, you have to “follow the money”. In the search industry two recent events have pointed the way: firstly, Attivio raised $34 million in new funding. Attivio produce a solution based on their own Active Intelligence Engine (yes, it’s still just a search engine) which itself is based on open source projects such as Apache Lucene. Secondly, this week the new(ish) company formed to offer support for the ElasticSearch open source search engine also raised funding to the tune of $10m.

From these two events we can conclude that the smart money has realised that the enterprise search market is heading in only one direction – towards open source software or solutions mainly based on it (another good example being our partner LucidWorks). News from this week’s ApacheCon in Germany of incredibly busy sessions around Lucene, Solr and ElasticSearch (as well as related and complimentary projects such as Stanbol) shows that the technical community agrees. I don’t think this will be the last time we hear of a significant investment by both the financial and technical communities in open source search.

A revolution in open standards in government

Something revolutionary has been happening recently in the UK government with regard to open source software, standards and data. Change has been promised before and some commentators have been (entirely correctly) cynical about the eventual result, but it seems that finally we have some concrete results. Not content with a public policy and procurement toolkit for open source software, the Cabinet Office today released a policy on open standards – and unlike many had feared, they have got it right.

Why do open standards matter? Anyone who has attempted to open a Word document of recent vintage in an older version of the same software will know how painful it can be. In the world of search we often have to be creative in how we extract data from proprietary, badly documented and inconsistent formats (get thee behind me, PDF!) – at Flax we came up with a novel method involving a combination of Microsoft’s IFilters and running Open Office as a server (you can download our Flax Filters as open source if you’d like to see how this works). If all else fails it is sometimes possible to extract strings from the raw binary data. However, we generally don’t have to preserve paragraphs, formatting and other specifics – and that is the kind of fine detail that often matters, especially in the government or legal arena. Certain companies have been downright obstructive in how they define their ’standards’ (and I use that word extremely loosely in this case). The same companies have been accused by many of trying to influence the Cabinet Office consultation process, introducing the badly defined FRAND concept. However, the consultation process has been carefully and correctly run and the eventual policy is clear and well written.

It will be very interesting to see how commercial closed source companies react to this policy – but in the meantime those of us in the open source camp should be cheered by the news that finally, after many false starts and setbacks, ‘open’ really does mean, well, ‘open’.

Tags: , ,

Posted in News

November 2nd, 2012

No Comments »