Archive for the ‘Uncategorized’ Category

Following the money….all the way to open source search.

There’s an old saying that to find out what’s really going on, you have to “follow the money”. In the search industry two recent events have pointed the way: firstly, Attivio raised $34 million in new funding. Attivio produce a solution based on their own Active Intelligence Engine (yes, it’s still just a search engine) which itself is based on open source projects such as Apache Lucene. Secondly, this week the new(ish) company formed to offer support for the ElasticSearch open source search engine also raised funding to the tune of $10m.

From these two events we can conclude that the smart money has realised that the enterprise search market is heading in only one direction – towards open source software or solutions mainly based on it (another good example being our partner LucidWorks). News from this week’s ApacheCon in Germany of incredibly busy sessions around Lucene, Solr and ElasticSearch (as well as related and complimentary projects such as Stanbol) shows that the technical community agrees. I don’t think this will be the last time we hear of a significant investment by both the financial and technical communities in open source search.

The Twelve Days of (Search) Christmas

On the twelfth day of (Search) Christmas my inbox brought to me:

Twelve users searching,
Eleven pages found,
Ten facets shown,
Nine Search Meetups,
Eight entity extractors,
Seven SOLR servers,
Six Xapian patches,
Five Open Source,
Four cloud apps,
Three Lucid partners,
Two big acquisitions,
And a Mike Lynch on board at HP.

Have a great Christmas and New Year from everyone at Flax.

Tags: ,

Posted in Uncategorized, events

December 22nd, 2011

No Comments »

Cambridge Search Meetup review – Two different kinds of university search

James Alexander of the Open University talked first on the Access to Video Assets project, a prototype system that looked at preservation, digitisation and access to thousands of TV programs originally broadcast by the BBC. James’ team have worked out an approach based on open source software – storing programme metadata and video assets in a Fedora Commons repository, indexing and searching using Apache Solr, authentication via Drupal – that is testament to the flexibility of these packages (some of which are being used in non-traditional ways – for example Drupal is used in a ‘nodeless’ fashion). He showed the search interface, which allowed you to find the exact points within a long video where particular words are mentioned and play video directly with a pop-up window. I’d seen this talk before (here’s a video and slides from Lucene Eurocon) but what I hadn’t grasped is how Solr is used as a mediation layer between the user and what can be some very complex data around the video asset itself (subtitles, rights information, format information, scripts etc.). As he mentioned, search is being used as a gateway technology to effective re-use of this huge archive.

Udo Kruschwitz was next with a brief treatment of his ongoing work on automatically extracting domain knowledge and using this to improve search results (for example see the ‘Suggestions’ on the University of Essex website) – he showed us some of the various methods his team have tried to analyze query logs, including Ant Colony Optimisation (modelling ‘trails’ of queries that can be reinforced by repeat visits, or ‘fade’ over time as they are less used). I liked the concept of developing a ‘community’ search profile where individual search profiles are hard to obtain – and how this could be simply subdivided (so for example searchers from inside a university might have a different profile to those outside). The key idea here is that all these techniques are automatic, so the system is continually evolving to give better search suggestions and hints. Udo and his team are soon to release an open source adaptive search framework to be called “Sunny Aberdeen” which we look forward to hearing about.

The evening ended with networking and a pint or two in traditional fashion – thanks to both our speakers and to all who came, from as far afield as Milton Keynes, Essex and Luton. The group now has 70 members and we’re building an active and friendly local community of search enthusiasts.

Outside the search box – when you need more than just a search engine

Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).

However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ’search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).

There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.

These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!

Is Enterprise Search dead? No, but it’s changing…

I spent yesterday morning at Ovum’s briefing on Enterprise Search, and they kindly invited me to sit on a discussion panel. One of the more controversial topics raised by analyst Mike Davis was ‘Is Enterprise Search dead?’ which provoked some lively discussion. We also heard from Tyler Tate of Twigkit on Search UX, Exalead on Search Based Applications and Search Technologies on data conditioning and why metadata is so important.

One can’t deny that the search market is going through some huge changes at the moment. Larger vendors are being acquired which can lead to some major (and not always welcome) changes in the product, pricing and service. Smaller vendors are finding it increasingly hard to compete with the plethora of powerful open source solutions (we’ve heard rumours of prices of closed source solutions being dropped radically to attempt to secure new business). There are also some interesting moves towards more comprehensive Business Intelligence and Unified Access solutions, such as Attivio.

I don’t think enterprise search is dying as a market or an offering, simply changing – and hopefully for the better, into an era of more realistic pricing, solutions that actually work (rather than promising ‘magic’) and more openness in terms of the technology and capability.

Networking in a great city for enterprise search

Cambridge, U.K. has a long history of hosting search experts and businesses. Back in the 1980s two firms arose – Cambridge CD Publishing, founded by Martin Porter and John Snyder grew into Muscat, and Cambridge Neurodynamics became Autonomy. We believe Smartlogic still have a small office here. Stephen Robertson, co-author of the probabilistic theory of information retrieval (which Xapian uses for ranking) is based here at Microsoft Research.

Today, the city is still home to innovative search companies, including True Knowledge, Grapeshot and of course ourselves. We know of many more ‘under the radar’ developing search technologies both to complement existing systems and as completely new approaches to information retrieval, including visual search.

To encourage networking and to help keep the city at the forefront of search developments, we’ve created the Enterprise Search Cambridge Meetup group and our first meeting is on February 16th – all are welcome, whether currently working with search and related technologies or simply interested in the possibilities. Hope to meet you there!

Tags: , , , ,

Posted in Uncategorized, events

January 14th, 2011

No Comments »

How not to make the same mistake twice

We’ve been aware that some FAST customers will be considering migration for a while now – but Autonomy have finally caught up.

However, if you migrate from one closed source solution to another, how can you guarantee that the same sort of events that have led to the current situation won’t happen again? With open source, there’s no vendor lock-in, a wide choice of companies to assist you with development an integration, a wealth of different support options and of course no license fees to pay. Migrating from FAST is a common topic at conferences at the moment – read Jan Høydahl’s presentation, or see Michael McIntosh’s video. There are even open source document processing pipeline frameworks to replace the popular FAST one, and we’ve been evaluating some alternative language processing frameworks. Scaling isn’t an issue and some cases you could significantly reduce your hardware budget.

Tags: , , ,

Posted in Uncategorized

December 6th, 2010

No Comments »

Questions to ask your search vendor

#1 – How does it work?
You’ll probably get as many different answers to this as there are vendors – but you may not get the whole truth. Bear in mind that a lot of search engines share what theoretical ideas they apply. An engine might use a vector-space or probabilistic models for ordering results, for example. Most will create an inverted index.

#2 – How fast is it?
Every search engine will take a finite amount of time to index a document or produce search results. Some of these processes will be limited by how fast data can be written to or read from disk, some by how fast the processor can do calculations. The key point is whether this time is going to work for you – will your users care if some complicated queries take ten seconds rather then a fraction of a second? Is there a time in the middle of the night when the system can spend a couple of hours building a new index? Watch out for silly answers such as “it’s instantaneous”.

#3 – How does it scale?
Whatever data you have today, you’ll have more tomorrow! How many servers will you need today, and how easy is it to add more in the future as necessary? Will this affect the speed of indexing or searching? Cloud-based solutions can help, especially when the amount of data or queries can be variable.

#4 – How much does it cost?
This is a question with several potential answers: the cost of a software license (of course, with open source code this can be zero), the cost of integration and customisation so the engine fits your requirements and the cost of ongoing support. Beware of a solution that promises much, but only after months of customisation. You should also ask how the cost scales with any growth in the number of source documents or users.

#5 – What happens if the vendor is taken over or disappears?
If the vendor is acquired by another company, or goes out of business, what happens to the software? The new owners may force you to move to their preferred solution, or in the worst case you can be left with no support for an obsolescent product. Ask if the vendor offers escrow. Open source licensing may also be a solution.

The above is not meant to be a complete list – feel free to suggest further questions!

Tags: , ,

Posted in Uncategorized

November 2nd, 2010

No Comments »

Online Information 2009, day 1

I visited the Online Information exhibition yesterday at Olympia. My first impression was that the exhibition area was very quiet – and a few of the exhibitors agreed with me. The current financial situation would seem the obvious cause. At previous shows exhibitors have given away all kinds of freebies, from bags, to mini mice, to branded juggling balls….but this year you’d be lucky if you came away with a couple of free pens and a boiled sweet.

I dropped in on the associated conference later, and caught a presentation titled “The Real Time Web: Discovery vs. Search”. Antonio Gulli of Microsoft told us about their new European offices, including one in Soho, that were concentrating on bringing new features to Bing – but the results look very familiar, is Bing doomed to play catch-up? The only ‘real time’ feature he discussed was indexing Twitter, although apparently they’ll soon be indexing Facebook as well. Surely real time encompasses more than these two platforms?

Stephen Arnold gave us his thoughts on what we should mean by ‘real time’, sensibly talking about how the financial services world has been using real time systems for many years. He also injected some notes of caution about how difficult it is to trust information spread amongst peers on social networking sites – here’s a recent case, read further down the page for a great quote from Graham Cluley.

Someone from Endeca (I didn’t catch the name, he was replacing the published speaker) showed us lots of slides of various applications of search, but his theme seemed more about how search can replace traditional databases than about ‘real time’, something I’ve blogged about recently.

We finished with Conrad Wolfram, demonstrating Wolfram Alpha, which isn’t really a search engine but rather a computation engine – it tries to give you a set of answers, rather than a list of possible resources where the answer might be found. Not a lot of ‘real time’ here either.

I’m back on Thursday as part of the closing keynote panel.

Tags: ,

Posted in Uncategorized, events

December 2nd, 2009

No Comments »

Xapian compared

Vik Singh has been comparing various open source solutions for search. He only spent a weekend performing the comparison, which is probably not enough time to get any search software performing at its best, and his results reflect this. Xapian was marked down for being slow at indexing (he says 5x slower than SQLite – but then again, SQLite isn’t a search engine, it’s a RDBMS, and really isn’t suitable for search applications) and for producing large index files, much bigger than Lucene.

The reason for this is that Xapian stores different information to Lucene. For example, the full term list (un-inverted index) is retained, which makes it possible to do relevance feedback. Also, Lucene handles deletes by maintaining a separate list of deleted documents, which is merged at the next optimise step – which means that the internal statistics are wrong until this point, and that updates can be more complicated, as an updated document needs a new ID.

Neither approach is wrong and both have advantages – Lucene certainly has smaller index files. Some judicious use of the XAPIAN_FLUSH_THRESHOLD parameter, as suggested in some of the comments on the article, would have certainly speeded up Xapian indexing. We can also look forward to the release of the new Xapian ‘Chert’ backend, which will produce indexes at least 50% smaller than the current ‘Flint’ backend. It’s also hard to say how important index sizes are in these days of cheap storage.

On the search side, Xapian performed comparably to Lucene in terms of relevance and search speed (both were ahead of all the other solutions on these metrics, especially SQLite). There are some other metrics he quoted, such as a ’support’ figure, given as a score out of 5, which he admits is entirely subjective – you’d have to ask our customers about that one! There’s also no comparison of features, ease of integration and scalability to very large collections.

We’ve talked before about performance metrics. Vik should be applauded for his article and for releasing his test framework as open source, hopefully this can be a foundation for some more in-depth studies.