Posts Tagged ‘cloud’

Lucene/Solr London User Group – Alfresco & Datastax

We had another London user group Meetup last week, hosted by who also provided some tasty pizza – eaten under the ‘Love Mondays’ sign from their adverts, which now lives in their boardroom! A few new faces this time and a couple of great talks from two companies who have incorporated Solr into their platforms.

First up was Andy Hind, a founding developer of document management company Alfresco, who told us all about how they originally based their search capability on Lucene 2.4, then moved to Solr 4.4 and most recently version 4.9.1. Using Solr they have implemented often complex security requirements (originally using a PostFilter as Erik Hatcher describes and more recently in the query itself), structured queries (using Phrase and SpanQueries) and their own domain specific query language (DSL) – they can support SQL-like, Lucene and Google-like queries by passing them through parsers based on ANTLR to be served either by the search engine or whatever relational database Alfresco is using. The move to a recent version of Solr has allowed the most recent release of Alfresco to support various modern search features (facets, spelling suggestions etc.) but Andy did mention that so far they are not using SolrCloud for scaling, preferring to manage this themselves.

Next up was Sergio Bossa of Datastax, talking about how their Datastax Enterprise (DSE) product incorporates Solr searching within an Apache Cassandra cluster. Sergio has previously spoken at our Cambridge search meetup on a very similar subject, so I won’t repeat myself here, but the key point is that Solr lives directly on top of the Cassandra cluster, so you don’t have to worry about it at all – search features are directly available from the Cassandra APIs. Like Alfresco, this is an alternative to SolrCloud (assuming you also need a NoSQL database of course!).

Thanks again to Alex Rice for hosting the Meetup, to both our speakers and to all who came – we’ll return soon! In the meantime you may want to check out a few events coming later this year: Berlin Buzzwords, ApacheCon Europe and Lucene/Solr Revolution.

Tags: , , , ,

Posted in Technical, events

February 16th, 2015

No Comments »

Solr Superclusters for improved federated search

As part of our BioSolr project, we’ve been discussing how best to create a federated search over several Apache Solr instances. In this case various research institutions across the world are annotating data objects representing proteins and it would be useful to search not just the original protein data, but what others have added to the body of knowledge. If an institution wants to use the annotations, the usual approach is to download the extra data regularly and add it into a local Solr index.

Luckily Solr is widely used in the bioinformatics community so we have commonality in the query API. The question is would it be possible to use some of the distributed querying capabilities of SolrCloud to search not just the shards of a single index, but a group of Solr/SolrCloud indices – a supercluster.

This is a bit like a standard federated search, where queries are farmed out to various disparate search engines and the results then combined and displayed in some fashion. However, since we are sharing a single technology, powerful features such as result grouping would be possible.

For this to work at all, there would need to be some agreed standard between the various Solr systems: a globally unique record identifier for example (possibly implemented with a prefix unique to each institution). Any data that was required for result grouping would have to share a schema across the entire supercluster – let’s call this the primary schema – but basic searching and faceting could still be carried out over data with a differing, secondary schema. Solr dynamic fields might be useful for this secondary schema.

Luckily, research institutions are used to working as part of a consortium, and one of the conditions for joining would be agreeing to some common standards. A single Solr query API would then be available to all members of the consortium, to search not just their own data but everything available from their partners, without the slow and error-prone process of copying the data for local indexing.

We’re currently evaluating the feasibility of this idea and would welcome input from others – let us know what you think in the comments!

Tags: , , ,

Posted in Technical

January 20th, 2015

1 Comment »

More than an API – the real third wave of search technology

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.

London Elasticsearch User Group – September Meetup

Last night I joined a good-sized crowd at a venue on Hoxton Square for some talks on Elasticsearch – this Meetup group is very popular and always attracts a good proportion of people new to the world of search, as well as some familiar faces. I started with a quick announcement of our own Elasticsearch hackday in a few weeks time.

First of the speakers was Richard Pijnenburg with a surprisingly brief talk on Puppet and Elasticsearch – brief, because integrating the two is apparently very simple, requiring only a few lines of Puppet code. Some questions from the floor sparked a discussion of combining Puppet and Vagrant for setting up Elasticsearch instances: apparently very soon we’ll see a complete demo instance of Elasticsearch built using these technologies and including some example data, which will be very useful for those wanting to get started with the engine (here’s some more on this combination).

Next was Amit Talhan, ably assisted by Geza Kerekes, both from AlignAlytics who have been using Elasticsearch both as a data store, reporting store and more recently for analysing data from a survey of all the retail outlets in Nigeria. Generating a wealth of data across up to 1000 fields, including geolocation data harvested every five seconds, this survey could have been difficult if not impossible to handle using a traditional SQL database, but many of their colleagues were very used to SQL syntax and methods for analyzing data. Amit and Geza explained how they have used Elasticsearch and in particular aggregations to provide functionality such as checking for bad reporting by surveyors and unexpectedly high density areas (such as markets, where there may be 200 retail outlets in a few square metres). One challenge seems to have been how to explain to colleagues from the data analysis community that Elasticsearch can provide some, but not all of the functionality of a traditional database, but that alternative ways of indexing and querying data can be used to solve the same problems. Interestingly, performance testing by AlignAlytics proved that BigStep, a provider of ‘bare metal’ cloud hosting, could provide much better performance than their own dedicated servers.

Next was Mark Harwood with another of his fascinating investigations into how Elasticsearch can be used for analysis of user behaviour, showing how after a bad personal experience buying a new battery that turned out to be second-hand, he identified vendors with suspiciously positive reviews. He also discussed how behaviour-based term suggesters might be built using Elasticsearch’s significant_terms aggregration. His demonstration did remind me slightly of Xapian’s relevance feedback feature. I heard several people later say that they wished they had time for some of the fun projects Mark seems to work on!

The event finished with some lively discussion and some free pizza courtesy of Elasticsearch (the company). Thanks to Yann Cluchey as ever for organising the event and I look forward to seeing a few of the attendees in Cambridge soon – we’re only an hour or so by train from Cambridge plus a ten minute walk to the venue, so it should be an easy trip!

Why GCloud search is badly broken & how to fix it

The GCloud initiative and the associated CloudStore are a great idea – hoping to level the field of UK government IT supply, take advantage of flexible and agile delivery of software and services and help SMEs like ourselves compete against the large System Integrators (SIs) that dominate this market. GCloud sales have now reached £154m although this is still a fraction of what the UK government spends on IT. We’re on GCloud 5 ourselves by the way so I have a vested interest in helping potential customers find us, and we’ve helped with government systems before.

Unfortunately the Cloudstore itself has a search facility that is badly broken. There are several obvious issues: many of the entries created by the larger suppliers have been keyword stuffed – here’s a particularly egregious example from Atos which seems to include most of the terms used in software in the last few years. I found this using the search terms ‘enterprise search’ which produces very few relevant looking results. The online guidance for CloudStore search suggests putting double quotes around my terms (sadly I think few users will think of this) which improves things a little but there are still a lot of irrelevant results – an online conferencing system is fifth for example.

Fortunately all is not lost and in the next iteration of GCloud we are promised major improvements to the search engine. I’m hoping this will include phrase boosting. However, if the big SIs and others are allowed to create the sort of bad-quality content I have shown above, no search engine in the world will be able to sort the wheat from the chaff. It is essential that CloudStore entries are subject to some kind of curation and that keyword stuffing is banned and/or heavily penalised, otherwise SMEs like ourselves will still find it very hard to compete with the big SIs.

Update: it seems there is a new system under construction, and the search works a lot better. Let’s hope it comes out of alpha soon and can be used by purchasers!

Tags: , , ,

Posted in Business, Technical

June 26th, 2014

No Comments »

Lucene Revolution 2013, Dublin: day 1

Four of the Flax team are in Dublin this week for Lucene Revolution, almost certainly the largest event centred on open source search and specifically Lucene. There are probably a couple of hundred Lucene enthusiasts here and the event is being held at the Aviva Stadium on Landsdowne Road: look out the windows and you can see the pitch! Here are some personal reflections: a number of the talks I attended today have a connection to our own work in media monitoring which we’re talking about tomorrow.

Doug Turnbull’s Test Driven Relevancy was interesting, discussing OSC’s Quepid tool that allows content owners and search experts to work together to tweak and tune Solr’s options to present the right results for a query. I wondered whether this tool might eventually be used to develop a Learning to Rank option for Solr, as Lucene 4 now supports a pluggable scoring model.

I enjoyed Real-Time Inverted Search in the Cloud Using Lucene and Storm during which Joshua Conlin told us about running hundreds of thousands of stored queries in a distrubuted architecture. Storm in particular sounds worth investigating further. There is currently no attempt to reduce or ‘prune’ the set of queries before applying them: Joshua quoted speeds of 4000 queries/sec across their cluster of 8 instances: impressive numbers, but our own monitoring applications are working at 20 times that speed by working out which queries not to apply.

I broke out at this point to catch up with some contacts, including the redoubtable Iain Fletcher of Search Technologies – always a pleasure. After a sandwich lunch I went along to hear Andrzej Bialecki of Lucidworks talk about Sidecar Indexes, a method for allowing rapid updates to Lucene fields. This reminded me of our own experiments in this area using Lucene’s pluggable codecs.

Next was more from the Opensource Connections team, as John Berryman talked about their work to update a patent search application that uses a very old search syntax, BRS. This sounds very much the work we’ve done to translate one search engine syntax into another for various media monitoring companies – so far we can handle dtSearch and we’re currently finishing off support for HP/Autonomy Verity’s VQL (PDF).

This latter issue has got me thinking that perhaps it might be possible to collaboratively develop an open source search engine query language – various parsers could be developed to turn other search syntaxes into this language, and search engines like Lucene (or anything else) could then be extended to implement support for it. This would potentially allow much easier migration between search engine technologies. I’m discussing the concept with various folks at the event this week so do please get in touch if you are interested!

Back tomorrow with a further update on this exciting conference – tonight we’re all off to the Temple Bar area of Dublin for food and drink, generously provided by Lucidworks who should also be thanked for organising the Revolution.

Tags: , , , , , ,

Posted in Technical, events

November 6th, 2013


Meetups, genomes and hack days: Grant Ingersoll visits the UK

Lucene/Solr commiter, Mahout co-creator, LucidWorks co-founder and general all-round search expert Grant Ingersoll visited us last week on his way to the SIGIR conference in Dublin. We visited the European Bioinformatics Institute on the Wellcome Trust Genome Campus to hear about some fascinating projects using Lucene/Solr to index genomes, phenomes and proteins and for Grant to give a talk on recent developments in both Lucene/Solr and Mahout – it was gratifying that over 50 people turned up to listen and at least 30 of these indicated they were using the technology.

After a brief rest it was then time to travel to London so Grant could talk at the Enterprise Search London Meetup on both recent developments in Lucene/Solr and what he dubbed ‘Search engine (ab)use’ – some crazy use cases of Lucene/Solr including for very fast key/value storage. Some great statistics including how Twitter make new tweets searchable in around 50 microseconds using only 8-10 indexing servers.

Next it was back to Cambridge for our own Lucene/Solr hack day in a great new co-working space. Attendees ranged from those who had never used Lucene/Solr to those with significant search expertise, and some had come from as far away as Germany – after a brief introduction we split into several groups each mentored by a member of the Flax team. Two groups (one comprised entirely of those who had never used Lucene) worked on a dataset of tweets from UK members of parliament and a healthy sense of competition developed between them – you can see some of the code they developed at in our Github account including an entity extractor webservice. Another group, led by Grant, created a SolrCloud cluster, with around 1-2 million documents split into 2 shards – running on ten laptops over a wireless connection! Impressively this was set up in less than ten minutes. Others worked on their own applications including an index of proteins and there was even some work on the Lucene/Solr code itself.

We’re hoping to put the results of some of these projects live very soon, so you can see just what can be built in a single day using this powerful open source software. Thanks to all who came, our hosts at Cambridge Business Lounge and of course Grant for his considerable energy and invaluable expertise. If nothing else, we’ve introduced a lot more people to open source search and sparked some ideas, and we ended off the week with beer in a sunny pub garden which is always nice!

Search Solutions 2012 – a review

Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.

Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.

Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.

Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.

Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.

The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.

Cambridge Search Meetup – Search for publication success and low-cost apps

After a short break the Cambridge Search Meetup returned last night with our usual mix of presentations, questions, networking, beer and snacks. We had a few issues with the projector and cables (one of these is on the shopping list for next time) so thanks to both presenters and audience for their patience!

First up was Liang Shen with a description of Journal Selector, a system for helping those publishing academic papers to find the correct journals to approach. The system allows one to copy and paste a chunk of a paper to a website and find which journals best match the subject matter, based on what they have published in the past. Running on the Amazon EC2 cloud the service indexes journals from feeds, HTML webpages and other sources, processes and stores this data in Amazon’s Hadoop-compatible database, indexes it with Apache Solr and then presents the results via the Drupal CMS. The results are impressive, allowing users to see exactly on what basis the system has recommended a journal to approach. You can see the presentation slides here.

Next was Rich Marr, who bravely offered to live-code a demonstration of his low-cost prototyping methodology for startups needing both NoSQL data storage and search across this data. In only 20 lines or so of code he showed us how to use Node.js to build a simple server that could accept messages (over Telnet, although HTTP or even IMAP would be as easy), store them in a CouchDB database and index them for searching (using a different message) with Elasticsearch. Rich’s demo prompted a lively discussion of how commoditized and componentized search technology is becoming, with open source components that allow one to build a prototype search engine in minutes.

Thanks to both our speakers – and the Meetups continue, with Rich Marr’s own London Open Source Search Social meeting on Tuesday 23rd October, and in Cambridge the Data Insights Meetup where I’ll be talking on November 1st.

Apache Lucene & Solr version 4.0 released, a giant leap forward for open source search

This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:

  • Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
  • More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
  • A new web administration interface (including Solr Cloud features)
  • New spatial search features including polygon support
  • General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
  • Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
  • Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.

The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!