Archive for the ‘Technical’ Category

Building high-end search features at low cost with Apache Solr

One of the best things about the increased use of open source search technology is that features that were previously unattainable for clients with small budgets are now within reach. Our client Bride and Groom Direct, a UK-based business selling wedding gifts and stationery, asked us if we could help improve the search features on their website and in particular the auto-suggest – and they asked us to take a look at the website of US mega-retailer Sears.com for inspiration. They particularly liked the way that while you type, Sears’ website doesn’t just show you suggested words but also clickable picture previews of products you might be looking for.

Using Apache Solr and in under two days we built them a similar feature for their website: since we didn’t have direct access to their development servers we provided both Solr configuration files and a simple JQuery/Javascript demo of the features they needed (it’s about 170 lines of code). Their own developers then integrated these changes based on our notes. I think it’s safe to say that Bride and Groom Direct are a rather smaller business than Sears, but with open source they can have access to equally good search facilities. They’ve been kind enough to let us feature them on our Clients page and as you can see, they’re happy with the results.

Tags: , , , ,

Posted in Technical

March 1st, 2013

No Comments »

Cambridge Search Meetup – a night of crawling and scraping

Last night was the busiest ever Cambridge Search Meetup, with two excellent talks and a lot of discussion and networking. First was Harry Waye of Arachnys, who provide access to data on emerging markets that no-one else has using a variety of custom crawling technology and heavy use of tools such Google Translate. If you want to trawl the Greek corporate registry or find out financial news from Kazakhstan a standard Google search is little help: Harry talked about how Arachnys have experimented with Google Custom Search Engine and the ‘headless browser’ PhantomJS to crawl sites.

Our second talk was from Shane Evans, who I first met when he led software development for our client Mydeco. While there he first worked on the development of an open source Python crawling framework, Scrapy: Shane showed how easy it is to get a Scrapy web spider running in a few lines of code, and how extensible and customisable Scrapy is for a huge variety of crawling and scraping situations. There’s even a fully hosted version at Scrapinghub with graphical tools for setting up web crawling and page scraping. We’re big fans of Scrapy at Flax and we’ve used it in a number of projects, so it was good to see an overview of why Scrapy exists and how it can be used.

Thanks to both our speakers who both travelled from out of town as did several other attendees: we’re pleased to say this was our 15th Meetup and we now have 100 members – we’re already planning further events, one will be on the evening of the first day of the Enterprise Search Europe conference.

Tags: , , , , ,

Posted in Technical, events

February 22nd, 2013

No Comments »

Phony wars: the battle between Solr and Elasticsearch

The most well known open source search engine, Apache Lucene/Solr, has a rival in Elasticsearch, also based on Apache Lucene. Or maybe it doesn’t. I’m not convinced that there’s an actual battle going on here, above and beyond the fact that the commercial companies formed to support each technology (Lucidworks and Elasticsearch [the company]) are obviously competitors. Let’s look at the evidence:

  • Elasticsearch contains (by some measures) 64 years of effort, Solr only 55 years….a point to Elasticsearch!
  • Elasticsearch commits are 31% down on last year, Solr commits are 85% up…a point to Solr!
  • There are more books about Solr than Elasticsearch…a point to Solr!
  • Elasticsearch, sorry elasticsearch, has a cool lower case logo and fancy website…a point to Elasticsearch!

This is of course before we get to any actual technical differences in terms of performance, scalability, ease-of-use etc. which are probably a lot more important than the list above. There are vocal critics and supporters of each project on Twitter and other media, but the great thing in our view is that there is a choice of two such excellent search technologies, both open source, so for real world applications one can try both at little cost and choose whichever is most appropriate (there are even proven migration routes between the two – we’ve helped one client with this process).

Tags: , , , ,

Posted in Business, Technical

January 14th, 2013

3 Comments »

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Apache Lucene & Solr version 4.0 released, a giant leap forward for open source search

This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:

  • Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
  • More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
  • A new web administration interface (including Solr Cloud features)
  • New spatial search features including polygon support
  • General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
  • Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
  • Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.

The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!

Tuning and improving elasticsearch for the Government Digital Service

The exciting GOV.UK project is getting close to its first release date of October 17th and we were asked by them to help with some search tuning as they migrate from Apache Solr to elasticsearch. Although elasticsearch has some great features there are still some areas where it lags Solr, such as the lack of spelling suggestion and proximity boost features. Alan from Flax spent a couple of days working with the GDS team and has blogged about how proximity boosting in particular can be implemented – at least for terms that are relatively close to each other rather than being separated by a page or so.

If you’re interested in more details of how we fixed this and a few other elasticsearch issues, you may want to take a look at the code we worked on – one of the best things about working with the GOV.UK team is that it was already up as open source software within a day (yes, you read that right – code paid for by the taxpayer is open source, as it should be!). We’re looking forward to launch day!

Update: changed ‘proximity search’ to ‘proximity boost’ – thanks Alan!

Tags: , , , ,

Posted in Technical

October 1st, 2012

No Comments »

Updating individual fields in Lucene with a Redis-backed codec

A customer of ours has a potential search application which requires (largely for reasons of performance) the ability to update specific individual fields of Apache Lucene documents. This is not the first time that someone has asked for this functionality. However, until now, it has been impossible to change field values in a Lucene document without re-indexing the entire document. This was due to the write-once design of Lucene index segment files, which would necessitate re-writing the entire file if a single value changes.

However, the introduction of pluggable codecs in Lucene 4.0 means that the concrete representation of index segments has been abstracted away from search functionality, and can be specified by the codec designer. The motivation for this was to make it possible to experiment with new compression schemes and other innovations, however it may also make it possible to overcome the current limitation of whole-document-only updates.

Andrzej Bialecki has proposed a “stacked update” design on top of the Lucene index format, in which changed fields are represented by “diff” documents which “overlay” the values of an existing document. If the “diff” document does not contain a certain field, then the value is taken from the original, overlaid document. This design is currently a work in progress.

Approaching the challenge independently, we have started to experiment with an alternative design, which makes a clear distinction between updatable and non-updateable fields. This is arguably a limitation, but one which may not be important in many practical applications (e.g. adding user tags to documents in a corpus). Non-updatable fields are stored using the standard Lucene codec, while updatable fields are stored externally by a codec that uses Redis, an open-source, flexible, fast key-value store. Updates to these fields could then be made directly in the Redis store using the JRedis library.

We have written a minimal, 2-day proof of concept, which can be checked out with:

svn checkout http://flaxcode.googlecode.com/svn/trunk/LuceneRedisCodec

There is still a significant amount of work to be done to make this approach robust and performant (e.g. when Lucene merges segments, the Redis document IDs will have to be remapped). At this stage we would welcome any comments and suggestions about our approach from anyone who is interested in this area of functionality.

Tags: , , , , ,

Posted in Technical

June 22nd, 2012

5 Comments »

Better search for e-petitions – handling misspelled content with a Solr phonetic filter

We recently overhauled the search functionality for the UK government’s e-petitions site, run by the Government Digital Service, a new team within the Cabinet Office. Search has an important function on the site; users are forced to search for existing petitions which cover their area of concern before creating a new one. This cuts down on the number of near-duplicate petitions, and makes petitions more effective.

The website is implemented in Ruby on Rails, using the Sunspot Solr client library. There are currently only 22,000 petitions, of no more than a few kilobytes each – easily enough to fit into the cache of a standard server. Despite this, the previous configuration was performing badly, and maxing out 8 CPU cores on a virtual machine under a load of a few hundred queries per second. Retrieval was also poor, with no results at all found for queries like “EU”.

The first thing we did was to install Solr 3.6 (the previous version was the rather elderly 1.4) running in Jetty on Ubuntu. Then we looked at the schema and search implementation. The former was using the standard Sunspot field mappings, which is fine for many applications but in this case was not allowing flexibility of weighting. Searches used the standard query parser to parse a hand-constructed query string with different field weightings and frequent use of the fuzzy match operator (e.g. “leasehold~0.8″). This seemed to be the most likely cause of poor performance under load.

Fuzzy matching had been used because of the frequent misspellings in petition text entered by users (e.g. “marraige” instead of “marriage”). Solr spelling correction on the query is not appropriate here, as correctly-spelled queries may not find misspelled content. But since fuzzy matching was performing badly on a relatively small index, we needed a new approach.

What we came up with was two levels of fields: the first being normalised with lowercasing and KStem but otherwise matching exactly, the second using a PhoneticFilterFactory to perform a Double Metaphone encoding on terms. We hoped that the misspellings in the corpus would transform to the same terms under this filter (e.g. “marriage” and “marraige” both yielding “MJ” etc.) The exact fields should provide precision, the phonetic fields, retrieval. Fields were populated using the copyField directive, without changing the client indexing code. We configured an eDisMax query handler to provide a simple interface and removed the custom query string construction from the client code.

In practice, this worked very well – the new server can handle search loads 5 times or greater compared with the previous one, and the CPUs are never maxed out (despite the server having only 4 cores compared with the previous 8). Ranking and retrieval are also greatly improved, and searches for “EU” return relevant petitions!

Phonetic algorithms are never going to catch all misspellings, and had Solr 4.0 been released at this time (with its very fast fuzzy engine) then it would have been the obvious approach to try. However, for now the search is much better, in less than 2 days of effort.

Tags: , , , ,

Posted in News, Technical

May 24th, 2012

1 Comment »

An open source replacement for the dtSearch closed source search engine

We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).

First, we developed a new Lucene Analyzer that speaks the same syntax as dtSearch, allowing us to index text input. On the search side we have a Lucene QueryParser that shares this syntax. To make it easier to use we’ve wrapped the whole lot in a modified Solr server. As we needed some features of very recent Lucene code, our modifications are based on a patch to Lucene trunk (and so the source code isn’t for the faint hearted – if you need it let us know, but we’re not currently providing it for download).

We’re not sure if there’s anyone else out there who needs an open source alternative to dtSearch – but in case there is we’ve provided a downloadable WAR file with the latest Solr executables in our downloads area, including a brief README file.

More generally, what this project demonstrates is that even if you have significant investment in your existing search infrastructure it is entirely possible to move to an open source alternative, which may be faster and will almost certainly be more economically scalable. Does anyone else have a search engine they’d like to replace?

Amazon CloudSearch – a game changer?

Amazon have just launched a cloud-based search service, which promises a ‘fully managed search service in the cloud’ – and it certainly looks impressive, with auto-scaling built in. You simply create a service, upload documents as JSON or XML and then perform searches. For cases where you need to search publically available data this offers a great way to avoid having to install and integrate any search software – of course it won’t be so popular if you’re worried about where your data actually is, or other complications such as the Patriot Act.

As you might expect, some people are already offering services based around CloudSearch (we’d be happy to do the same - just ask!) and there’s a demo of searching Wikipedia available. I’m not sure who SmackBot is but I’m slightly wary of reading any Wikipedia articles it’s had something to do with…

Of course searching Wikipedia is nothing new and I sometimes wish for a different choice of source material for search demos.

One thing that seems clear is that with the rise of cloud-based search options (here’s another from our partners Lucid Imagination, based on Apache Lucene/Solr) the cost and complication of ’simple’ search projects could fall dramatically, applying further pressure to those companies selling closed source search engines for frankly unrealistic prices. Amazon’s offering, with their huge experience in cloud-based services, has the potential to be a game changer for the search market.

Tags: , , , ,

Posted in News, Technical

April 12th, 2012

No Comments »