Posts Tagged ‘lucene’

Search events for 2013

Here’s a quick roundup of search-related events coming soon:

Next week Lucene/Solr Revolution is to be held in San Diego, with a couple of days of training on April 29th & 30th and the main event on the 1st and 2nd May. This is probably the biggest event dedicated to Apache Lucene/Solr and features a huge array of presentations from Etsy, Wells Fargo, Lucidworks and even Microsoft who are increasingly supporting open source technologies.

Enterprise Search Europe is next on 15th and 16th May with a day of workshops on the 14th, including one from the Flax team. I’m looking forward to the various open source panels and presentations of course, and hearing from people from Ernst & Young, Neilsen Norman Group, Oracle and the University of Manchester. We’re also running a Meetup event on the first evening, open to all, with the usual informal mix of beer, snacks and search!

Some of the Flax team are hoping to attend Berlin Buzzwords on June 3rd & 4th – this conference promises to address “search”, “store” and “scale” – certainly sounds interesting! We know there will be lots of talks on elasticsearch and Lucene/Solr.

There’s more to come in the Autumn of course – more details when we know them. Hope to meet you at one of these great events!

Phony wars: the battle between Solr and Elasticsearch

The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:

  • Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!
  • Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!
  • There are more books about Solr than Elasticsearch…a point to Solr!
  • Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!

This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).

Tags: , , , ,

Posted in Business, Technical

January 14th, 2013

3 Comments »

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Following the money….all the way to open source search.

There’s an old saying that to find out what’s really going on, you have to “follow the money”. In the search industry two recent events have pointed the way: firstly, Attivio raised $34 million in new funding. Attivio produce a solution based on their own Active Intelligence Engine (yes, it’s still just a search engine) which itself is based on open source projects such as Apache Lucene. Secondly, this week the new(ish) company formed to offer support for the ElasticSearch open source search engine also raised funding to the tune of $10m.

From these two events we can conclude that the smart money has realised that the enterprise search market is heading in only one direction – towards open source software or solutions mainly based on it (another good example being our partner LucidWorks). News from this week’s ApacheCon in Germany of incredibly busy sessions around Lucene, Solr and ElasticSearch (as well as related and complimentary projects such as Stanbol) shows that the technical community agrees. I don’t think this will be the last time we hear of a significant investment by both the financial and technical communities in open source search.

The death of enterprise search is reported, again

There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.

Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem - “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise – Amazon.com, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.

Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.

I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.

Tags: , , , , ,

Posted in News

October 25th, 2012

No Comments »

Apache Lucene & Solr version 4.0 released, a giant leap forward for open source search

This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:

  • Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
  • More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
  • A new web administration interface (including Solr Cloud features)
  • New spatial search features including polygon support
  • General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
  • Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
  • Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.

The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!

Eleven years of open source search

It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.

When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.

So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.

Media monitoring with open source search – 20 times faster than before!

We’re happy to announce we’ve just finished a successful project for a division of the Australian Associated Press to replace a closed source search engine with a considerably more powerful open source solution. You can read the press release here.

As our client had a large investment in stored searches (which represent a client’s interests) which were defined in the query language of their previous search engine, we first had to build a modified version of Apache Lucene that replicated exactly this syntax. I’ve previously blogged about how we did this. However this wasn’t the only challenge: search engines are designed to be good at applying a few queries to a very large document collection, not necessarily at applying tens of thousands of stored queries to every single new document. For media monitoring applications this kind of performance is essential as there may be hundreds of thousands of news articles to monitor every day. The system we’ve built is capable of applying tens of thousands of stored queries every second.

With the rapid increase in the volume of content that media monitoring companies have to check for their clients – today’s news isn’t just in print, but online, in social media and indeed multimedia – it may be that open source software is the only way to build monitoring systems that are economically scalable, while remaining accurate and flexible enough to deliver the right results to clients.

Tags: , , , ,

Posted in News

July 25th, 2012

2 Comments »

An open day on open source search from Sirius & Flax

We spent Friday at the riverside offices of Sirius Corporation, our support partners, for the first and hopefully not the last of their Open Days on open source enterprise search. We were lucky to have Mike Davis, a very well known and highly experienced analyst to open the talks – despite suffering from flu he gave an engaging talk on why open source enterprise search software should be your first port of call, and how you should only consider closed source options when you need particular features they provide.

We then gave a quick Introduction to Open Source Search, detailing the various packages available (from Apache Lucene/Solr to Xapian and Sphinx) and showing a quick Solr-powered demo we’d built to search some pages from the BBC Music website. Using the programmer’s first choice for an example query (the ever reliable ‘foo*’) we discovered the wonderfully named Original Rabbit Foot Spasm Band – which interestingly you can’t find via the BBC’s own site search engine due to lack of wildcard support.

Andrew Savory, Sirius’ CTO and Apache Foundation member, then gave a presentation on what an Apache project actually is and how best to engage with an open source community – very useful for those considering open source for the first time. The morning finished with a delicious barbeque on the riverbank provided by Sirius. We thought the event went very well and we’d love to confirm the rumour that this will become a regular event. Thanks to all at Sirius for organising and hosting the day and we look forward to returning.

Updating individual fields in Lucene with a Redis-backed codec

A customer of ours has a potential search application which requires (largely for reasons of performance) the ability to update specific individual fields of Apache Lucene documents. This is not the first time that someone has asked for this functionality. However, until now, it has been impossible to change field values in a Lucene document without re-indexing the entire document. This was due to the write-once design of Lucene index segment files, which would necessitate re-writing the entire file if a single value changes.

However, the introduction of pluggable codecs in Lucene 4.0 means that the concrete representation of index segments has been abstracted away from search functionality, and can be specified by the codec designer. The motivation for this was to make it possible to experiment with new compression schemes and other innovations, however it may also make it possible to overcome the current limitation of whole-document-only updates.

Andrzej Bialecki has proposed a “stacked update” design on top of the Lucene index format, in which changed fields are represented by “diff” documents which “overlay” the values of an existing document. If the “diff” document does not contain a certain field, then the value is taken from the original, overlaid document. This design is currently a work in progress.

Approaching the challenge independently, we have started to experiment with an alternative design, which makes a clear distinction between updatable and non-updateable fields. This is arguably a limitation, but one which may not be important in many practical applications (e.g. adding user tags to documents in a corpus). Non-updatable fields are stored using the standard Lucene codec, while updatable fields are stored externally by a codec that uses Redis, an open-source, flexible, fast key-value store. Updates to these fields could then be made directly in the Redis store using the JRedis library.

We have written a minimal, 2-day proof of concept, which can be checked out with:

svn checkout http://flaxcode.googlecode.com/svn/trunk/LuceneRedisCodec

There is still a significant amount of work to be done to make this approach robust and performant (e.g. when Lucene merges segments, the Redis document IDs will have to be remapped). At this stage we would welcome any comments and suggestions about our approach from anyone who is interested in this area of functionality.

Tags: , , , , ,

Posted in Technical

June 22nd, 2012

5 Comments »