The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:
- Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!
- Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!
- There are more books about Solr than Elasticsearch…a point to Solr!
- Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!
This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).
This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:
- Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
- More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
- A new web administration interface (including Solr Cloud features)
- New spatial search features including polygon support
- General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
- Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
- Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.
The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!
Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).
Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.
However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.
It will be interesting to see if this version of the GSA is the last…
It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.
When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.
So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.
There’s been a recent flurry of activity from search vendors (and those larger companies that have been buying them) around the theme of Big Data, which has become the fashionable marketing term for a sheaf of technologies including search, machine learning, Map Reduce and for scalability in general. If anyone impertinently asks why company X bought company Y the answer seems to be ‘because they have capability in Big Data and our customers will need this’.
Search companies like ours have been working with large datasets since the beginning – back in 1999/2000 the founders of Flax led a team to build a half-billion-page Web search engine, which as I recall ran on a cluster of 30 or so servers. Since then we’ve worked with other collections of tens or hundreds of millions of items. Even a relatively small company can have a few million files on their intranet, if you count all those emails, customer records and Powerpoint presentations. So yes, you could say we can do Big Data – we certainly know how to design and build systems that scale.
However it makes me nervous when a set of technologies that could (in theory) be used together are simply lumped together for marketing purposes as the Next Big Thing. The devil is as always in the detail (and the integration) and it’s important to remember that just because you can fit all your data into a system doesn’t mean that system will help you make any sense of it. A recent term for unstructured data (which of course us search developers have been working with for decades) is Dark Data, which implies that it is mysterious and hidden – but that doesn’t mean it has any actual value. Those considering a Big Data project should be aware that in any computer system GIGO is still an issue.
Amazon have just launched a cloud-based search service, which promises a ‘fully managed search service in the cloud’ – and it certainly looks impressive, with auto-scaling built in. You simply create a service, upload documents as JSON or XML and then perform searches. For cases where you need to search publically available data this offers a great way to avoid having to install and integrate any search software – of course it won’t be so popular if you’re worried about where your data actually is, or other complications such as the Patriot Act.
As you might expect, some people are already offering services based around CloudSearch (we’d be happy to do the same - just ask!) and there’s a demo of searching Wikipedia available. I’m not sure who SmackBot is but I’m slightly wary of reading any Wikipedia articles it’s had something to do with…
Of course searching Wikipedia is nothing new and I sometimes wish for a different choice of source material for search demos.
One thing that seems clear is that with the rise of cloud-based search options (here’s another from our partners Lucid Imagination, based on Apache Lucene/Solr) the cost and complication of ’simple’ search projects could fall dramatically, applying further pressure to those companies selling closed source search engines for frankly unrealistic prices. Amazon’s offering, with their huge experience in cloud-based services, has the potential to be a game changer for the search market.
Media monitoring is not a traditional search application: for a start, instead of searching a large number of documents with a single query, a media monitoring application must search every incoming news story with potentially thousands of queries, searching for words and terms relevant to client requirements. This can be difficult to scale, especially when accuracy must be maintained – a client won’t be happy if their media monitors miss relevant stories or send them news that isn’t relevant.
We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).
The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.
As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.
#1 – How does it work?
You’ll probably get as many different answers to this as there are vendors – but you may not get the whole truth. Bear in mind that a lot of search engines share what theoretical ideas they apply. An engine might use a vector-space or probabilistic models for ordering results, for example. Most will create an inverted index.
#2 – How fast is it?
Every search engine will take a finite amount of time to index a document or produce search results. Some of these processes will be limited by how fast data can be written to or read from disk, some by how fast the processor can do calculations. The key point is whether this time is going to work for you – will your users care if some complicated queries take ten seconds rather then a fraction of a second? Is there a time in the middle of the night when the system can spend a couple of hours building a new index? Watch out for silly answers such as “it’s instantaneous”.
#3 – How does it scale?
Whatever data you have today, you’ll have more tomorrow! How many servers will you need today, and how easy is it to add more in the future as necessary? Will this affect the speed of indexing or searching? Cloud-based solutions can help, especially when the amount of data or queries can be variable.
#4 – How much does it cost?
This is a question with several potential answers: the cost of a software license (of course, with open source code this can be zero), the cost of integration and customisation so the engine fits your requirements and the cost of ongoing support. Beware of a solution that promises much, but only after months of customisation. You should also ask how the cost scales with any growth in the number of source documents or users.
#5 – What happens if the vendor is taken over or disappears?
If the vendor is acquired by another company, or goes out of business, what happens to the software? The new owners may force you to move to their preferred solution, or in the worst case you can be left with no support for an obsolescent product. Ask if the vendor offers escrow. Open source licensing may also be a solution.
The above is not meant to be a complete list – feel free to suggest further questions!