One of the best things about the increased use of open source search technology is that features that were previously unattainable for clients with small budgets are now within reach. Our client Bride and Groom Direct, a UK-based business selling wedding gifts and stationery, asked us if we could help improve the search features on their website and in particular the auto-suggest – and they asked us to take a look at the website of US mega-retailer Sears.com for inspiration. They particularly liked the way that while you type, Sears’ website doesn’t just show you suggested words but also clickable picture previews of products you might be looking for.
Using Apache Solr and in under two days we built them a similar feature for their website: since we didn’t have direct access to their development servers we provided both Solr configuration files and a simple JQuery/Javascript demo of the features they needed (it’s about 170 lines of code). Their own developers then integrated these changes based on our notes. I think it’s safe to say that Bride and Groom Direct are a rather smaller business than Sears, but with open source they can have access to equally good search facilities. They’ve been kind enough to let us feature them on our Clients page and as you can see, they’re happy with the results.
The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:
- Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!
- Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!
- There are more books about Solr than Elasticsearch…a point to Solr!
- Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!
This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).
2012 has been a fascinating and stormy year for those of us in the search business. We’ve seen a raft of further acquisitions of commercial closed source search companies by bigger players, some convinced that what used to be called Enterprise Search is now a solution to Big Data (like Stephen Arnold we wonder what will succeed Big Data as the next marketing term – I love his phrase “In a quest for revenue, the vendors will wrap basic ideas in a cloud of unknowing”). One acquisition hasn’t gone so smoothly: Autonomy, bought by HP for a price that no-one in the search business thought was remotely sensible, has been accused of being oversold vapourware: this is a story that will continue to develop in 2013. If you want a great overview of the current market read Martin White’s latest research note.
Here in the slightly calmer waters of open source search, we’ve seen a huge rise in enquiries from often blue-chip companies, no longer needing persuasion that open source is a serious contender for even the largest search and content projects. Often these companies have considered large commercial solutions but are put off by both the price and high-pressure marketing tactics – in a world of reduced budgets you simply can’t sell magic beans for a pile of gold. We’ve also seen increased interest in related technologies such as machine learning and automatic categorisation – search really isn’t just about search any more.
At Flax we’re busier than we have ever been and we’re expected the trend to continue. We’re looking forward to running more Cambridge Search Meetups, visiting and helping organise conferences such as Enterprise Search Europe and Lucene Revolution, building our network of carefully chosen partners and of course working on exciting and cutting-edge development projects.
As the storms in our sector continue to rage overhead we’ll simply be getting on with what we do best, building effective search.
I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).
Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.
Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.
Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.
It certainly isn’t magic.
It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.
When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.
So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.
We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.
C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.
News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.
The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.
Flax Search Server now has a Perl client, thanks to the guys at Cognidox, who have blogged about why they needed to improve the search facility for their powerful document management system.
When we’re contacted by potential clients, we have to gather as much information as possible about how and why they need search technology. This either takes the form of a physical or telephone meeting and much scribbling in notebooks, or a long exchange of emails. In all cases there are some important questions that must be answered, and I thought it might be useful to list the most common ones here:
How many items do you need to search?
The number of items to search varies widely, from a few thousand to hundreds of millions. This number impacts both the eventual size of a searchable index and how fast it can be built, and will thus inform the eventual system design, both in hardware and software terms. It’s usually possible to search from 5 to 50 million items on a single server – but this also depends on the answer to the next questions:
How big/complex are the items to be searched?
This includes both the size of each item and what data it contains: for example does each item contain a price, or a characteristic like an author’s name, or colour. The item can be part of a group of items, have user tags applied, or be restricted to a certain group of users. The searchable index we build will have to take account of all this information in the correct way, so we can search it effectively.
What other systems must the search engine work with?
Sometimes search engines will have to fit into an existing infrastructure – say an intranet or web application framework – and sometimes they will have to extract information from another system, such as a relational database. The engine may also have to take account of existing security systems, which can impact how each search result is delivered. It may have to deliver search results as a web page, or as a report, or as an email. There’s obviously a huge variety of possible systems to interact with, not least the operating system or platform.
What’s your schedule for delivering a search solution?
This is another key point – it can be relatively quick to build a simple search application, but if the system is going to be very large or very complex, or if a staged delivery based on user feedback is required, then it’s important to know what the expectations are. We’ve installed systems in a couple of days, and built more complex ones over years.
In all cases it’s important to realise that every client will have differing requirements and expectations, and to be sure that everyone ends up satisfied with the end result, the more information we can gather at the start of the process, the better.