Posts Tagged ‘analyst’

Cambridge Search Meetup – Elasticsearch Hackday

Last Friday we hosted a hackday featuring Elasticsearch in Cambridge, following a similar event last year focused on Apache Lucene/Solr. Around 20 people attended from organisations working in sectors including analytics, digital music, bioinformatics and e-commerce, and all the Flax team were there as well.

We started with a brief presentation on Elasticsearch and asked around the room for any data collections we might be able to use. Lee from Elasticsearch (the company) had brought collections of UK crime data and the complete works of Shakespeare; we also had several million rows of digital music metadata, Wikipedia edit data for all UK MPs (to follow last year’s theme!) and several years of data describing Premier League football. Unlike our Solr hackday where each team worked on the same general task, this time we split into four different teams who worked on all of the above except the Wikipedia edits. We’d also been provided with a very high-performance Elasticsearch cluster by BigStep for our use, which meant it was very quick to index the above data and start working with it.

By lunchtime (the food was sponsored by Elasticsearch, who also provided stickers, plush ELKs and lollypops – thanks guys!) we had some very basic information about the various datasets – such as which scene in which Shakespeare play has the most characters on stage (the answer is 21 in Richard III), and which football teams seemed to gain the most advantage from playing at home. Note that we had already moved beyond basic search functionality to use Elasticsearch as an analytic platform, answering particular questions, using features such as aggregations.

We continued during the afternoon to develop the various applications and finished with a ’show and tell’. Some of the teams had managed to develop user interfaces for Elasticsearch, the most polished being a clickable Google Map that would show you which types of crime were significantly above and below the national average for the area you selected – unsurprisingly in Cambridge, stolen bicycles were very common! By the end of the day, everyone had gained experience of Elasticsearch, some for the first time. We finished the day, as is traditional, with a swift pint and further networking.

Thanks to Cambridge Business Lounge (a highly recommended co-working space) for the venue, BigStep for hosting and Elasticsearch for sponsoring lunch and providing the swag, and of course to all who attended. We’ll return with a further Cambridge Search Meetup soon!

London Elasticsearch User Group – September Meetup

Last night I joined a good-sized crowd at a venue on Hoxton Square for some talks on Elasticsearch – this Meetup group is very popular and always attracts a good proportion of people new to the world of search, as well as some familiar faces. I started with a quick announcement of our own Elasticsearch hackday in a few weeks time.

First of the speakers was Richard Pijnenburg with a surprisingly brief talk on Puppet and Elasticsearch – brief, because integrating the two is apparently very simple, requiring only a few lines of Puppet code. Some questions from the floor sparked a discussion of combining Puppet and Vagrant for setting up Elasticsearch instances: apparently very soon we’ll see a complete demo instance of Elasticsearch built using these technologies and including some example data, which will be very useful for those wanting to get started with the engine (here’s some more on this combination).

Next was Amit Talhan, ably assisted by Geza Kerekes, both from AlignAlytics who have been using Elasticsearch both as a data store, reporting store and more recently for analysing data from a survey of all the retail outlets in Nigeria. Generating a wealth of data across up to 1000 fields, including geolocation data harvested every five seconds, this survey could have been difficult if not impossible to handle using a traditional SQL database, but many of their colleagues were very used to SQL syntax and methods for analyzing data. Amit and Geza explained how they have used Elasticsearch and in particular aggregations to provide functionality such as checking for bad reporting by surveyors and unexpectedly high density areas (such as markets, where there may be 200 retail outlets in a few square metres). One challenge seems to have been how to explain to colleagues from the data analysis community that Elasticsearch can provide some, but not all of the functionality of a traditional database, but that alternative ways of indexing and querying data can be used to solve the same problems. Interestingly, performance testing by AlignAlytics proved that BigStep, a provider of ‘bare metal’ cloud hosting, could provide much better performance than their own dedicated servers.

Next was Mark Harwood with another of his fascinating investigations into how Elasticsearch can be used for analysis of user behaviour, showing how after a bad personal experience buying a new battery that turned out to be second-hand, he identified vendors with suspiciously positive reviews. He also discussed how behaviour-based term suggesters might be built using Elasticsearch’s significant_terms aggregration. His demonstration did remind me slightly of Xapian’s relevance feedback feature. I heard several people later say that they wished they had time for some of the fun projects Mark seems to work on!

The event finished with some lively discussion and some free pizza courtesy of Elasticsearch (the company). Thanks to Yann Cluchey as ever for organising the event and I look forward to seeing a few of the attendees in Cambridge soon – we’re only an hour or so by train from Cambridge plus a ten minute walk to the venue, so it should be an easy trip!

Analysts getting a bad press – how can they do better?

It seems to be a bad summer for analyst companies in several sectors: here’s Forrester getting a kicking from Digital Clarity Group about their Wave report on Digital Experience Delivery Platforms (my first challenge was understanding what on earth those are, but I think it’s a new shiny name for web content management), Nuix putting the boot into Gartner about their eDiscovery Magic Quadrant, and Stephen Few jumping up and down in hobnail boots on both analyst firms about Business Intelligence (insert your own joke here), complete with a not particularly enlightening reply from Forrester themselves.

Miles Kehoe has already taken a look at Gartner’s Magic Quadrant report on our own Enterprise Search sector. I’ve written before on how I don’t think open source solutions are particularly well treated by the large analyst firms, as they often focus on vendors only. The world has somewhat changed though and five of the seventeen vendors mentioned are using a base of open source technology, so at least some of this major part of the market is covered.

However the problem remains that the MQ ignores a great deal of the enterprise search sector: it doesn’t cover Sharepoint with its FAST-derived search facility, Oracle’s Endeca (which apparently is now no longer available as a standalone product, a surprise to me), Funnelback (which is again incorrectly labelled as open source – it’s the Squiz CMS software that’s open source, not the search engine they bought) or the rising star of Elasticsearch. If you were new to the sector you might conclude that none of these options are available to you. Gartner itself says “This Magic Quadrant introduces search managers and information architects in end-user organizations to the range of enterprise search vendors they can choose from” – but this range is severely and artificially restricted.

Let’s hope that the analyst firms take note of some of this bad press – perhaps it’s time to change approach, be more open about biases and methodologies, and stop producing hugely oversimplified diagrams to characterise complex and deep business sectors.

Tags: , , , ,

Posted in Business

July 30th, 2014

1 Comment »

As Hadoop gains, does Lucene benefit?

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

Why we won’t pay to play at conferences

One unedifying result of having been asked to speak on open source search at various events and conferences over the last few years is the discovery that not all events are equal – some genuinely wish to create a programme of interesting talks of value to the audience, and some simply wish to sell as much sponsorship as possible to those who would like to present. Some of the larger analyst firms are guilty of this behaviour – their Summits and Forums are often packed with talks by big-budget solution providers (and their industry sector reports similarly reflect the fact that if you pay, you play). At Flax we don’t have much budget for sponsorship so we’re often excluded, even though the talks we give are seldom if ever pushing any particular solution – a benefit of the open source model is that even if you hear about it from us you can still go and download and use the software yourself without paying us or anyone else a penny.

Luckily there are events that don’t work like this – the excellent Search Solutions day run in late Autumn by the British Computer Society and of course Enterprise Search Europe (disclaimer: I’m on the programme committee for the latter). My view is this means we get a higher quality set of talks, presenters who know and can discuss their subject rather than just reading out the company-approved Powerpoint deck, and attendees can see a wider range of views and options.

The death of enterprise search is reported, again

There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.

Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem - “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise –, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.

Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.

I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.

Tags: , , , , ,

Posted in News

October 25th, 2012

No Comments »

An open day on open source search from Sirius & Flax

We spent Friday at the riverside offices of Sirius Corporation, our support partners, for the first and hopefully not the last of their Open Days on open source enterprise search. We were lucky to have Mike Davis, a very well known and highly experienced analyst to open the talks – despite suffering from flu he gave an engaging talk on why open source enterprise search software should be your first port of call, and how you should only consider closed source options when you need particular features they provide.

We then gave a quick Introduction to Open Source Search, detailing the various packages available (from Apache Lucene/Solr to Xapian and Sphinx) and showing a quick Solr-powered demo we’d built to search some pages from the BBC Music website. Using the programmer’s first choice for an example query (the ever reliable ‘foo*’) we discovered the wonderfully named Original Rabbit Foot Spasm Band – which interestingly you can’t find via the BBC’s own site search engine due to lack of wildcard support.

Andrew Savory, Sirius’ CTO and Apache Foundation member, then gave a presentation on what an Apache project actually is and how best to engage with an open source community – very useful for those considering open source for the first time. The morning finished with a delicious barbeque on the riverbank provided by Sirius. We thought the event went very well and we’d love to confirm the rumour that this will become a regular event. Thanks to all at Sirius for organising and hosting the day and we look forward to returning.

Enterprise Search Europe 2012 – Big Data, search surveys and some FUD from Google

I visited Enterprise Search Europe for the first day only last week, and caught a number of the presentations as well as giving one of my own (which I won’t discuss here but you’ll hear more about over the next few weeks). First up was Paul Doscher of Lucid Imagination with a lively presentation discussing whether search is either dead or now a commodity, or whether search on Hadoop is the new killer app for the emerging world of Big Data. We then had Kristian Norling from Findwise with some initial results from their survey on enterprise search – some interesting numbers here such as ‘18.5% of users are mostly/very satisfied with search’ and only ‘6% have a search strategy although 46% are planning one’ – we hear that Kristian is hoping to make the survey an annual one, which will be a great resource for anyone in the industry.

Matt Mullen, fuelled by diet cola, gave an introduction to search with a key point – that enterprise search usually performs a role within a workflow or task – a fact often ignored. Runar Buvik of Searchdaimon talked about a great resource he has developed comparing search engines, which can give some often amusing contrasts between different technologies, with some insisting there are no results for a particular query while others find thousands. I also enjoyed Emma Bayne and Donald Phillips polished presentation on the search facilities at the National Archives – interestingly although Autonomy is currently powering their search they are considering open source alternatives.

The day concluded with a presentation from Matt Eichner of Google, who turned up with their own film crew. You can read much of what he said at Computer World. I’m afraid I didn’t enjoy this presentation very much – it talked down to the audience and contained a lot of FUD around open source (surprising when Google uses and supports so much of it) – complete with sympathy-garnering pictures of babies in incubators and silly analogies about how one should prefer to fly in the airplane that cost the most. I hadn’t realised until his talk that the Google Search Appliance appears to be made of cheese!

It was great to network and catch up, and I hope next year to be able to attend the whole event. Thanks to all the organisers especially Martin White of Intranet Focus.

Big Data – It’s not always big and it’s not always clever

There’s been a recent flurry of activity from search vendors (and those larger companies that have been buying them) around the theme of Big Data, which has become the fashionable marketing term for a sheaf of technologies including search, machine learning, Map Reduce and for scalability in general. If anyone impertinently asks why company X bought company Y the answer seems to be ‘because they have capability in Big Data and our customers will need this’.

Search companies like ours have been working with large datasets since the beginning – back in 1999/2000 the founders of Flax led a team to build a half-billion-page Web search engine, which as I recall ran on a cluster of 30 or so servers. Since then we’ve worked with other collections of tens or hundreds of millions of items. Even a relatively small company can have a few million files on their intranet, if you count all those emails, customer records and Powerpoint presentations. So yes, you could say we can do Big Data – we certainly know how to design and build systems that scale.

However it makes me nervous when a set of technologies that could (in theory) be used together are simply lumped together for marketing purposes as the Next Big Thing. The devil is as always in the detail (and the integration) and it’s important to remember that just because you can fit all your data into a system doesn’t mean that system will help you make any sense of it. A recent term for unstructured data (which of course us search developers have been working with for decades) is Dark Data, which implies that it is mysterious and hidden – but that doesn’t mean it has any actual value. Those considering a Big Data project should be aware that in any computer system GIGO is still an issue.

Tags: , ,

Posted in Business

May 11th, 2012

No Comments »

The Fall and rise of search in a world of Big Data – part 2

The theme of Big Data continued at the next conference I attended, the first Enterprise Search Europe held in London. There was a good mix of presentations ranging from the academic to the practical, my favourite probably being Martin Belam and colleague’s talk about using Solr to dynamically generate content for the new Guardian Books site. I was lucky enough to be able to talk about the real business benefits of open source search along with one of our customers, Stephen Wicks, CTO of Gorkana Group, which drew some interesting questions. We also ran a combined Meetup on the Monday evening, combining Enterprise Search Cambridge with Enterprise Search London.

There did seem to be a rather negative spin on search from many presenters – saying that search technology is misunderstood, more costly than expected, rarely works and hasn’t seen much recent innovation. Some of this is true – but I see this as an opportunity rather than a problem. There is more focus on the world of search now than before due to some high-profile acquisitions; people are questioning the value and capability of search technology. Those of us working at the cutting edge, delivering real working solutions, should perhaps take this opportunity to say that yes, it can be done, at a sensible cost, and it can deliver real business benefit. Perhaps as we move further into the world of Big Data we’ll realise the true value of effective search.

Tags: , , , ,

Posted in events

October 31st, 2011

No Comments »