analyst – Flax

A lack of cognition and some fresh FUD from Forrester

Charlie Hull — Wed, 14 Jun 2017 09:05:36 +0000

Last night the estimable Martin White, intranet and enterprise search expert and author of many books on the subject, flagged up two surprising articles from Forrester who have declared that Cognitive Search (we’ll define this using their own terms in a little while) is ‘overshadowing’ the ‘outmoded’ Enterprise Search, with a final dig at how much better commercial options are compared to open source.

Let’s start with the definition, helpfully provided in another post from Forrester. Apparently ‘Cognitive search solutions are different because they: Scale to handle a multitude of data sources and types’. Every enterprise search engine promises to index a multiplicity of content both structured and unstructured, so I can’t see why this is anything new. Next we have ‘Employ artificial intelligence technologies….natural language processing (NLP) and machine learning’. Again, NLP has been a feature of closed and open source enterprise search systems for years, be it for entity extraction, sentiment analysis or sentence parsing. Machine learning is a rising star but not always easy to apply to search problems. However I’m not convinced either of these are really ‘artificial intelligence’. Astonishingly, the last point is that Cognitive solutions ‘Enable developers to build search applications…provide SDKs, APIs, and/or visual design tools’. Every search engine needs user applications on top and has APIs of some kind, so this makes little sense to me.

Returning to the first article, we hear that indexing is ‘old fashioned’ (try building a search application without indexing – I’d love to know you’d manage that!) but luckily a group of closed-source search vendors have managed to ‘out-innovate’ the open source folks. We have the usual hackneyed ‘XX% of knowledge workers can’t find what they need’ phrases plus a sprinkling of ‘wouldn’t it be nice if everything worked like Siri or Amazon or Google’ (yes, it would, but comparing systems built on multi-billion-page Web indexes by Internet giants to enterprise search over at most a few million, non-curated, non-hyperlinked business documents is just silly – these are entirely different sets of problems). Again, we have mentions of basic NLP techniques like they’re something new and amazing.

The article mentions a group of closed source vendors who appear in Forrester’s Wave report, which like Gartner’s Magic Quadrant attempts to boil down what is in reality a very complex field into some overly simplistic graphics. Finishing with a quick dig at two open source companies (Elastic, who don’t really sell an enterprise search engine anyway, and Lucidworks whose Fusion 3 product really is a serious contender in this field, integrating Apache Spark for machine learning) it ignores the fact that open source search is developing at a furious rate – and there are machine learning features that actually work in practise being built and used by companies such as Bloomberg – and because they’re open source, these are available for anyone else to use.

To be honest It’s very difficult, if not impossible, to out-innovate thousands of developers across the world working in a collaborative manner. What we see in articles like the above is not analysis but marketing – a promise that shiny magic AI robots will solve your search problems, even if you don’t have a clear specification, an effective search team, clean and up-to-date content and all the many other things that are necessary to make search work well (to research this further read Martin’s books or the one I’m co-authoring at present – out later this year!). One should also bear in mind that marketing has to be paid for – and I’m pretty sure that the various closed-source vendors now providing downloads of Forrester’s report (because of course, they’re mentioned positively in it) don’t get to do so for free.

UPDATE: Martin has written three blog posts in response to both Gartner and Forrester’s recent reports which I urge you (and them) to read if you really want to know how new (or not) Cognitive Search is.

The post A lack of cognition and some fresh FUD from Forrester appeared first on Flax.

A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access

Charlie Hull — Tue, 17 Feb 2015 11:25:26 +0000

Stephen Arnold, whose blog I enjoy due to its unabashed cynicism about overenthusiastic marketing of search technology, was kind enough to send me a copy of his recent report on CyberOSINT & Next Generation Information Access (NGIA), the latter being a term he has recently coined. OSINT itself refers to intelligence gathered from open, publically available sources, not anything to do with software licenses – so yes, this is all about the NSA, CIA and others, who as you might expect are keen on anything that can filter out the interesting from the noise. Let’s leave the definition (and the moral questionability) of ‘publically available’ aside for now – even if you disagree with its motives, this is a use case which can inform anyone with search requirements of the state of the art and what the future holds.

The report starts off with a foreword by Robert David Steele, who has had a varied and interesting career and lately has become a cheerleader for the other kind of open source – software – as a foundation for intelligence gathering. His view is that the tools used by the intelligence agencies ‘are also not good enough’ and ‘We have a very long way to go’. Although he writes that ‘the systems described in this volume have something to offer’ he later concludes that ‘This monograph is a starting point for those who might wish to demand a “full spectrum” solution, one that is 100% open source, and thus affordable, interoperable, and scalable.’ So for those of us in the open source sector, we could consider Arnold’s report as a good indicator of what to shoot for, a snapshot of the state of the art in search.

Arnold then starts the report with some explanation of the NGIA concept. This is largely a list of the common failings of traditional search platforms (basic keyword search, oft-confusing syntax, separate silos of information, lack of multimedia features and personalization) and how they might be addressed (natural language search, automatic querying, federated search, analytics). I am unconvinced this is as big a step as Arnold suggests though: it seems rather to imply that all past search systems were badly set up and configured and somehow a NGIA system will magically pull everything together for you and tell you the answer to questions you hadn’t even asked yet.

Disappointingly the exemplar chosen in the next chapter is Autonomy IDOL: regular readers will not be surprised by my feelings about this technology. Arnold suggests the creation of the Autonomy software was influenced by cracking World War II codes, rock music and artificial intelligence, which is in my mind adding egg to an already very eggy pudding, and not in step with what I know about the background of Cambridge Neurodynamics (Autonomy’s progenitor, created very soon after – and across the corridor from – Muscat, another Cambridge Bayesian search technology firm where Flax’s founders cut their teeth on search). In particular, Autonomy’s Kenjin tool – which automatically suggested related documents – is identified as a NGIA feature, although at the time I remember it being reminiscent of features we had built a year earlier at Muscat – we even applied for a patent. Arnold does note that ‘[Autonomy founder, Mike] Lynch and his colleagues clamped down on information about the inner workings of its smart software.’ and ‘The Autonomy approach locks down the IDOL components.’ – this was a magic black box of course, with a magically increasing price tag as well. The price tag rose to ridiculous dimensions (even after an equally ridiculous writedown) when Hewlett Packard bought the company.

The report continues with analysis of various other potential NGIA contenders, including Google-funded timeline analysis specialists Recorded Future and BAE Detica – interestingly one of the search specialists from this British company has now gone on to work at Elasticsearch.

The report concludes with a look at the future, correctly identifying advanced analytics as one key future trend. However this conclusion also echoes the foreword, with ‘The cost of proprietary licensing, maintenance, and training is now killing the marketplace. Open source alternatives will emerge, and among these may be a 900 pound gorilla that is free, interoperable and scalable.’. Although I have my issues with some of the examples chosen, the report will be very useful I’m sure to those in the intelligence sector, who like many are still looking for search that works.

The post A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access appeared first on Flax.

Searching for opportunities in Real-Time Analytics

Charlie Hull — Mon, 02 Feb 2015 17:18:22 +0000

I spent a day last week at a new event from UNICOM, a conference on Real-Time Analytics. Mike Ferguson chaired the event and was kind enough to spend time with me over lunch exploring how search software might fit into the mix, something that has been on my mind since hearing about the Unified Log concept a few weeks ago.

Real-Time Analytics is a field where sometimes vast amounts of data in motion is gathered, filtered, cleaned and analysed to trigger various actions to benefit a business: building on earlier capabilities in Business Intelligence, the endgame is a business that adapts automatically to changing conditions in real-time – for example, automating the purchasing of extra stock based on changing behaviour of customers. The analysis part of this chain is driven by complex models, often based on sets of training data. Complex Event Processing or CEP is an older term for this kind of process (if you’re already suffering from buzzword overflow, Martin Kleppman has put some of these terms in context for those more familiar with web paradigms). Tools mentioned included Amazon Kinesis and from the Apache stable Cassandra, Hadoop, Kafka, Yarn, Storm and Spark. I particularly enjoyed Michael Cutler‘s presentation on Tumra’s Spark-based system.

One of the central problems identified was due to the rapid growth of data (including from the fabled Internet of Things) it will shortly be impossible to store every data point produced – so we must somehow sort the wheat from the chaff. Options for the analysis part include SQL-like query languages and more complex machine learning algorithms. I found myself wondering if search technology, using a set of stored queries, could be used somehow to reduce the flow of this continuous stream of data, using something like this prototype implementation based on Apache Samza. One could use this approach to transform unstructured data (say, a stream of text-based customer comments) into more structured data for later timeline analysis, split streams of events into several parts for separate processing or just to watch for sets of particularly interesting and complex events. Although search platforms such as Elasticsearch are already being integrated into the various Real-Time Analytics frameworks, these seem to be being used for offline processing rather than acting directly on the stream itself.

One potential advantage is that it might be a lot easier for analysts to generate a stored search than to learn SQL or the complexities of machine learning – just spend some time with a collection of past events and refine your search terms, facets and filters until your results are useful, and save the query you have generated.

This was a very interesting introduction to a relatively new field and thanks to UNICOM for the invitation. We’re going to continue to explore the possibilities!

The post Searching for opportunities in Real-Time Analytics appeared first on Flax.

Cambridge Search Meetup – Elasticsearch Hackday

Charlie Hull — Fri, 03 Oct 2014 12:32:00 +0000

Last Friday we hosted a hackday featuring Elasticsearch in Cambridge, following a similar event last year focused on Apache Lucene/Solr. Around 20 people attended from organisations working in sectors including analytics, digital music, bioinformatics and e-commerce, and all the Flax team were there as well.

We started with a brief presentation on Elasticsearch and asked around the room for any data collections we might be able to use. Lee from Elasticsearch (the company) had brought collections of UK crime data and the complete works of Shakespeare; we also had several million rows of digital music metadata, Wikipedia edit data for all UK MPs (to follow last year’s theme!) and several years of data describing Premier League football. Unlike our Solr hackday where each team worked on the same general task, this time we split into four different teams who worked on all of the above except the Wikipedia edits. We’d also been provided with a very high-performance Elasticsearch cluster by BigStep for our use, which meant it was very quick to index the above data and start working with it.

By lunchtime (the food was sponsored by Elasticsearch, who also provided stickers, plush ELKs and lollypops – thanks guys!) we had some very basic information about the various datasets – such as which scene in which Shakespeare play has the most characters on stage (the answer is 21 in Richard III), and which football teams seemed to gain the most advantage from playing at home. Note that we had already moved beyond basic search functionality to use Elasticsearch as an analytic platform, answering particular questions, using features such as aggregations.

We continued during the afternoon to develop the various applications and finished with a ‘show and tell’. Some of the teams had managed to develop user interfaces for Elasticsearch, the most polished being a clickable Google Map that would show you which types of crime were significantly above and below the national average for the area you selected – unsurprisingly in Cambridge, stolen bicycles were very common! By the end of the day, everyone had gained experience of Elasticsearch, some for the first time. We finished the day, as is traditional, with a swift pint and further networking.

Thanks to Cambridge Business Lounge (a highly recommended co-working space) for the venue, BigStep for hosting and Elasticsearch for sponsoring lunch and providing the swag, and of course to all who attended. We’ll return with a further Cambridge Search Meetup soon!

The post Cambridge Search Meetup – Elasticsearch Hackday appeared first on Flax.

London Elasticsearch User Group – September Meetup

Charlie Hull — Thu, 04 Sep 2014 09:43:29 +0000

Last night I joined a good-sized crowd at a venue on Hoxton Square for some talks on Elasticsearch – this Meetup group is very popular and always attracts a good proportion of people new to the world of search, as well as some familiar faces. I started with a quick announcement of our own Elasticsearch hackday in a few weeks time.

First of the speakers was Richard Pijnenburg with a surprisingly brief talk on Puppet and Elasticsearch – brief, because integrating the two is apparently very simple, requiring only a few lines of Puppet code. Some questions from the floor sparked a discussion of combining Puppet and Vagrant for setting up Elasticsearch instances: apparently very soon we’ll see a complete demo instance of Elasticsearch built using these technologies and including some example data, which will be very useful for those wanting to get started with the engine (here’s some more on this combination).

Next was Amit Talhan, ably assisted by Geza Kerekes, both from AlignAlytics who have been using Elasticsearch both as a data store, reporting store and more recently for analysing data from a survey of all the retail outlets in Nigeria. Generating a wealth of data across up to 1000 fields, including geolocation data harvested every five seconds, this survey could have been difficult if not impossible to handle using a traditional SQL database, but many of their colleagues were very used to SQL syntax and methods for analyzing data. Amit and Geza explained how they have used Elasticsearch and in particular aggregations to provide functionality such as checking for bad reporting by surveyors and unexpectedly high density areas (such as markets, where there may be 200 retail outlets in a few square metres). One challenge seems to have been how to explain to colleagues from the data analysis community that Elasticsearch can provide some, but not all of the functionality of a traditional database, but that alternative ways of indexing and querying data can be used to solve the same problems. Interestingly, performance testing by AlignAlytics proved that BigStep, a provider of ‘bare metal’ cloud hosting, could provide much better performance than their own dedicated servers.

Next was Mark Harwood with another of his fascinating investigations into how Elasticsearch can be used for analysis of user behaviour, showing how after a bad personal experience buying a new battery that turned out to be second-hand, he identified Amazon.com vendors with suspiciously positive reviews. He also discussed how behaviour-based term suggesters might be built using Elasticsearch’s significant_terms aggregration. His demonstration did remind me slightly of Xapian’s relevance feedback feature. I heard several people later say that they wished they had time for some of the fun projects Mark seems to work on!

The event finished with some lively discussion and some free pizza courtesy of Elasticsearch (the company). Thanks to Yann Cluchey as ever for organising the event and I look forward to seeing a few of the attendees in Cambridge soon – we’re only an hour or so by train from Cambridge plus a ten minute walk to the venue, so it should be an easy trip!

The post London Elasticsearch User Group – September Meetup appeared first on Flax.

Analysts getting a bad press – how can they do better?

Charlie Hull — Wed, 30 Jul 2014 15:32:24 +0000

It seems to be a bad summer for analyst companies in several sectors: here’s Forrester getting a kicking from Digital Clarity Group about their Wave report on Digital Experience Delivery Platforms (my first challenge was understanding what on earth those are, but I think it’s a new shiny name for web content management), Nuix putting the boot into Gartner about their eDiscovery Magic Quadrant, and Stephen Few jumping up and down in hobnail boots on both analyst firms about Business Intelligence (insert your own joke here), complete with a not particularly enlightening reply from Forrester themselves.

Miles Kehoe has already taken a look at Gartner’s Magic Quadrant report on our own Enterprise Search sector. I’ve written before on how I don’t think open source solutions are particularly well treated by the large analyst firms, as they often focus on vendors only. The world has somewhat changed though and five of the seventeen vendors mentioned are using a base of open source technology, so at least some of this major part of the market is covered.

However the problem remains that the MQ ignores a great deal of the enterprise search sector: it doesn’t cover Sharepoint with its FAST-derived search facility, Oracle’s Endeca (which apparently is now no longer available as a standalone product, a surprise to me), Funnelback (which is again incorrectly labelled as open source – it’s the Squiz CMS software that’s open source, not the search engine they bought) or the rising star of Elasticsearch. If you were new to the sector you might conclude that none of these options are available to you. Gartner itself says “This Magic Quadrant introduces search managers and information architects in end-user organizations to the range of enterprise search vendors they can choose from” – but this range is severely and artificially restricted.

Let’s hope that the analyst firms take note of some of this bad press – perhaps it’s time to change approach, be more open about biases and methodologies, and stop producing hugely oversimplified diagrams to characterise complex and deep business sectors.

The post Analysts getting a bad press – how can they do better? appeared first on Flax.

As Hadoop gains, does Lucene benefit?

Charlie Hull — Thu, 27 Mar 2014 17:21:11 +0000

The last few weeks have seen a rush of investment in companies that offer Hadoop-powered Big Data platforms – the most recent being Intel’s investment in Cloudera, but Hortonworks has also snorted up $100m.

Gartner correctly explains that Hadoop isn’t just one project, but an ecosystem comprising an increasing number of open source projects (and some closed source distributions and add-ons). Once you’ve got your Big Data in a HDFS-shaped pile, there are many ways to make sense of it – and one of those is a search engine, so there’s been a lot of work recently trying to add Lucene-powered search engines such as Apache Solr and Elasticsearch into the mix. There’s also been some interesting partnerships.

I’m thus wondering whether this could signal a significant boost to the development of these search projects: there are already Lucene/Solr committers working at Hadoop-flavoured companies who have been working on distributed search and other improvements to scalability. Let’s hope some of the investment cash goes to search!

The post As Hadoop gains, does Lucene benefit? appeared first on Flax.

Why we won’t pay to play at conferences

Charlie Hull — Thu, 21 Mar 2013 14:59:28 +0000

One unedifying result of having been asked to speak on open source search at various events and conferences over the last few years is the discovery that not all events are equal – some genuinely wish to create a programme of interesting talks of value to the audience, and some simply wish to sell as much sponsorship as possible to those who would like to present. Some of the larger analyst firms are guilty of this behaviour – their Summits and Forums are often packed with talks by big-budget solution providers (and their industry sector reports similarly reflect the fact that if you pay, you play). At Flax we don’t have much budget for sponsorship so we’re often excluded, even though the talks we give are seldom if ever pushing any particular solution – a benefit of the open source model is that even if you hear about it from us you can still go and download and use the software yourself without paying us or anyone else a penny.

Luckily there are events that don’t work like this – the excellent Search Solutions day run in late Autumn by the British Computer Society and of course Enterprise Search Europe (disclaimer: I’m on the programme committee for the latter). My view is this means we get a higher quality set of talks, presenters who know and can discuss their subject rather than just reading out the company-approved Powerpoint deck, and attendees can see a wider range of views and options.

The post Why we won’t pay to play at conferences appeared first on Flax.

The death of enterprise search is reported, again

Charlie Hull — Thu, 25 Oct 2012 08:39:42 +0000

There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.

Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem – “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise – Amazon.com, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.

Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.

I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.

The post The death of enterprise search is reported, again appeared first on Flax.

An open day on open source search from Sirius & Flax

Charlie Hull — Mon, 23 Jul 2012 15:26:39 +0000

We spent Friday at the riverside offices of Sirius Corporation, our support partners, for the first and hopefully not the last of their Open Days on open source enterprise search. We were lucky to have Mike Davis, a very well known and highly experienced analyst to open the talks – despite suffering from flu he gave an engaging talk on why open source enterprise search software should be your first port of call, and how you should only consider closed source options when you need particular features they provide.

We then gave a quick Introduction to Open Source Search, detailing the various packages available (from Apache Lucene/Solr to Xapian and Sphinx) and showing a quick Solr-powered demo we’d built to search some pages from the BBC Music website. Using the programmer’s first choice for an example query (the ever reliable ‘foo*’) we discovered the wonderfully named Original Rabbit Foot Spasm Band – which interestingly you can’t find via the BBC’s own site search engine due to lack of wildcard support.

Andrew Savory, Sirius’ CTO and Apache Foundation member, then gave a presentation on what an Apache project actually is and how best to engage with an open source community – very useful for those considering open source for the first time. The morning finished with a delicious barbeque on the riverbank provided by Sirius. We thought the event went very well and we’d love to confirm the rumour that this will become a regular event. Thanks to all at Sirius for organising and hosting the day and we look forward to returning.

The post An open day on open source search from Sirius & Flax appeared first on Flax.