Analysts getting a bad press – how can they do better?

It seems to be a bad summer for analyst companies in several sectors: here’s Forrester getting a kicking from Digital Clarity Group about their Wave report on Digital Experience Delivery Platforms (my first challenge was understanding what on earth those are, but I think it’s a new shiny name for web content management), Nuix putting the boot into Gartner about their eDiscovery Magic Quadrant, and Stephen Few jumping up and down in hobnail boots on both analyst firms about Business Intelligence (insert your own joke here), complete with a not particularly enlightening reply from Forrester themselves.

Miles Kehoe has already taken a look at Gartner’s Magic Quadrant report on our own Enterprise Search sector. I’ve written before on how I don’t think open source solutions are particularly well treated by the large analyst firms, as they often focus on vendors only. The world has somewhat changed though and five of the seventeen vendors mentioned are using a base of open source technology, so at least some of this major part of the market is covered.

However the problem remains that the MQ ignores a great deal of the enterprise search sector: it doesn’t cover Sharepoint with its FAST-derived search facility, Oracle’s Endeca (which apparently is now no longer available as a standalone product, a surprise to me), Funnelback (which is again incorrectly labelled as open source – it’s the Squiz CMS software that’s open source, not the search engine they bought) or the rising star of Elasticsearch. If you were new to the sector you might conclude that none of these options are available to you. Gartner itself says “This Magic Quadrant introduces search managers and information architects in end-user organizations to the range of enterprise search vendors they can choose from” – but this range is severely and artificially restricted.

Let’s hope that the analyst firms take note of some of this bad press – perhaps it’s time to change approach, be more open about biases and methodologies, and stop producing hugely oversimplified diagrams to characterise complex and deep business sectors.

Tags: , , , ,

Posted in Business

July 30th, 2014

1 Comment »

Cambridge Search Meetup – Knowledge Discovery & Wayfinding

We were lucky enough to have two speakers from Cambridge text mining company Linguamatics at last night’s Meetup. Robin Newton kicked us off with an amusing and idiosyncratic view of the uses and mis-uses of search – starting with the problem that when you have text search software, every problem can look like search might solve it. He gave an example of his recent search for a new job: although matching his skills on paper with a potential employer’s needs is one thing, he also wants to be sure the employer ‘isn’t a crook’! With reference to Tyler Tate’s talks on Information Wayfinding, which in turn quotes urban planner Kevin Lynch, Robin told us how he felt that search ‘journeys’ weren’t always the most efficient way to discover an answer: his assertion was that finding a person who could tell you was more useful. Since even in the most efficient and well-run organisation not all information is held in documents one might agree that finding an ‘expert’ is the best way to get the answers one needs. He finished with a welcome message that informal networking in pubs and cafes (much like our Meetup) helps share a lot of very useful information – and this is how he eventually decided that Linguamatics was going to be a great place to work.

Next was CTO and co-founder of Linguamatics, Dr David Milward, who described his company’s capability in text mining, Natural Language Processing (NLP) and search. He described the challenges of extracting ‘concepts’ from text – how words and acronyms with multiple potential meanings are difficult to parse automatically without contextual knowledge. Linguamatics’ approach has been described as ‘Agile NLP’ and allows the quick development of new patterns for concept extraction. A powerful example he gave was how by specifying a relationship between two entities, in this case one company acquiring another, structured data can be extracted from unstructured text. Other examples focused on the medical and bioscience field (a particular interest of ours at present due to the upcoming BioSolr project) and showed how their software can cluster facts and find connections between disparate pieces of data (‘which X relates to Y via Z’). This process can also be used to generate new facets for searching from free text, including for numeric ranges, and these can even be tailored for different user groups. It’s clear that Linguamatics are experts in this area and David’s talk was of great interest to many in the room, including several from the European Bioinformatics Institute.

We finished with the usual chat, networking and drinks. Thanks to both our speakers – and do let me know if you have a suggestion for a presentation at a future event!

Why GCloud search is badly broken & how to fix it

The GCloud initiative and the associated CloudStore are a great idea – hoping to level the field of UK government IT supply, take advantage of flexible and agile delivery of software and services and help SMEs like ourselves compete against the large System Integrators (SIs) that dominate this market. GCloud sales have now reached £154m although this is still a fraction of what the UK government spends on IT. We’re on GCloud 5 ourselves by the way so I have a vested interest in helping potential customers find us, and we’ve helped with government systems before.

Unfortunately the Cloudstore itself has a search facility that is badly broken. There are several obvious issues: many of the entries created by the larger suppliers have been keyword stuffed – here’s a particularly egregious example from Atos which seems to include most of the terms used in software in the last few years. I found this using the search terms ‘enterprise search’ which produces very few relevant looking results. The online guidance for CloudStore search suggests putting double quotes around my terms (sadly I think few users will think of this) which improves things a little but there are still a lot of irrelevant results – an online conferencing system is fifth for example.

Fortunately all is not lost and in the next iteration of GCloud we are promised major improvements to the search engine. I’m hoping this will include phrase boosting. However, if the big SIs and others are allowed to create the sort of bad-quality content I have shown above, no search engine in the world will be able to sort the wheat from the chaff. It is essential that CloudStore entries are subject to some kind of curation and that keyword stuffing is banned and/or heavily penalised, otherwise SMEs like ourselves will still find it very hard to compete with the big SIs.

Update: it seems there is a new system under construction, and the search works a lot better. Let’s hope it comes out of alpha soon and can be used by purchasers!

Tags: , , ,

Posted in Business, Technical

June 26th, 2014

No Comments »

BioSolr – building better search for bioinformatics

The entire Flax technical team spent the day at the European Bioinformatics Institute yesterday discussing an exciting new project we’ll begin this coming September, BioSolr. Funded by the BBSRC this collaboration between Flax and the EBI aims “to significantly advance the state of the art with regard to indexing and querying biomedical data with freely available open source software”. Here we are with Dr. Sameer Valenkar and Gautier Koscielny of the EBI.

The EBI, located on the Wellcome Trust Genome Campus near Cambridge, maintains the world’s most comprehensive range of freely available and up-to-date molecular databases and is already using Apache Lucene/Solr extensively, for example in the Protein Databank in Europe which indexes over 100,000 items derived from experimental research – but this is just one of the many complex collections they provide. The BioSolr project will run for a full year, during which members of the Flax team will work directly with the EBI team to run workshops, demonstrate and document best practises in search application design, create, improve and extend open source software and learn a lot about the specialist search requirements of bioinformatics. This is a fantastic opportunity for us to push the boundaries of what is possible with Solr and associated software, to work with some incredibly rich data and to do all of this in the open to encourage collaboration from the wider software and biology communities.

We’ll be creating various open resources (software repositories, Wikis, blogs) to support the project later this year – do let us know if you would like to be involved and we will keep you informed.

Searching for IP addresses in text with Elasticsearch

We recently implemented a search solution for a customer using Elasticsearch. Most of their requirements were fairly standard, however they also wanted to be able to search for IP addresses embedded in the document text, using a flexible and precise search syntax, e.g. given the following document fragment:

    ... the API can be accessed at 167.87.3.201 on port 8700 ...

the following searches should all find the document:

  167.87.3.201
  *.87.3.201
  *.87.*.201
  167.[80-100].3.*
  etc.

While it would have been possible to implement the multiple wildcard requirement with Elasticsearch/Lucene regular expression queries, there is no simple way to handle the numeric range requirement without constructing some fairly complex regexps. Furthermore, regular expression queries can be slow to run (depending on the complexity of the expression and the size of the index), and this application had a large index.

The obvious thing to do here is to parse the IP address into separate numbers and index it into numeric fields. e.g.:

  {
    "ip1": 167,
    "ip2": 87,
    "ip3": 3,
    "ip4": 201,
    "text": "the API can be ..."
  }

Then, user queries such as “167.[80-100].3.*” can be parsed into an Elasticsearch query:

  {
    "query": {
      "bool": {
        "must": [
          { "term": { "ip1": 167 }},
          { "range": { "ip2": { "from": 80, "to": 100 }}},
          { "term": { "ip3": 3 }}
        ]
      }}}

(please note that these queries are for illustrative purposes only, and are untested).

Unfortunately, this approach fails when there is more than one IP address per document (as there generally was in this case), since if multiple values exist for the ipN fields the relationship between each component is lost. For example, a document containing:

    ... servers at 167.133.88.1 and 176.90.3.10 are load balanced ...

would spuriously match the user query above, despite the fact that neither IP address matches the query exactly. One possibility would be to use dynamic fields to index each address to a different set of fields:

  {
    "ip1_1": 167,
    "ip2_1": 133,
    "ip3_1": 88,
    "ip4_1": 1,
    "ip1_2": 176,
    "ip2_2": 90,
    "ip3_2": 3,
    "ip4_2": 10,
  }

However, queries would have to cover all possible IP fields with repeated OR subqueries, which would quickly become ugly and unmanagable.

Luckily, Elasticsearch nested documents provide exactly the mechanism we need to preserve the IP address structure within the main document (Solr does too, though this post does not go into the details). This is most easily explained with a JSON example with two IP addresses:

  {
    "text": "Lorem ipsum dolor sit amet, ei impetus persecuti eam...",
    "ipaddr" : [
      {
        "ip1": 167,
        "ip2": 133,
        "ip3": 88,
        "ip4": 1
      },
      {
        "ip1": 176,
        "ip2": 90,
        "ip3": 3,
        "ip4": 10
      }
    ]
  }

This requires a declaration of the ipaddr type as “nested” in the index mapping:

  ...
  "mappings": {
    "document": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "standard"
        },
        "ipaddr" : {
          "type" : "nested"
        },
        ...
      }}}

The child documents are created by the indexer script, which uses a regular expression to find all IP addresses in the document content and parses them into separate numbers. IP addresses can then be searched for using the nested query type, e.g:

  {
    "nested" : {
      "path" : "ipaddr",
      "query" : {
        "bool": {
            "must": [
              { "term": { "ip1": 167 }},
              { "range": { "ip2": { "from": 80, "to": 100 }}},
              { "term": { "ip3": 3 }}
            ]}}}}

This query selects parent documents containing at least one ipaddr child document which matches the query. Internally, children are stored as separate documents from parents, but the join is done transparently and extremely fast.

Nested queries can, of course, be combined with text queries etc. The application we built for the client (in AngularJS and Python/Flask) parses user queries to extract IP query expressions and builds combined text, boolean and nested queries to implement the required search logic.

One slight problem with this approach is that IP addresses are not included in any highlighted summaries generated by Elasticsearch as part of search results. This is because the highlighter does not know where in the text the matching IP address is. There is no simple way around this, so to generate highlighted search summaries we used our own standalone highlighter component, extending it to ‘understand’ the IP query syntax. This code is Apache 2 licensed and is free to download and use.

To sum up, this post outlines how we used Elasticsearch’s nested document type to implement a flexible and fast IP address search syntax. Of course, the same approach could be used to search any other type of structured entity in document text, such as social security numbers, ISBNs etc.

How not to predict the future of search

I’ve just seen an article titled Enterprise Search: 14 Industry Experts Predict the Future of Search which presents a list of somewhat contradictory opinions. I’m afraid I have some serious issues with the experts chosen and the undeniably blinkered views some of them have presented.

Firstly, if you’re going to ask a set of experts to write about Enterprise Search, don’t choose an expert in SEO as part of your list. SEO is not Enterprise Search, in fact a lot of the time it isn’t anything at all (except snake oil) – it’s a way of attempting to game the algorithms of web search engines. Secondly, at least make some attempt to prevent your experts from just listing the capabilities of their own companies in their answers: in fact one ‘expert’ was actually a set of PR-friendly answers from a company rather than a person, including listing articles about their own software. The expert from Microsoft rather predictably failed to notice the impact of open source on the search market, before going on to put a positive spin on the raft of acquisitions of search companies over the last few years (and it’s certainly not all good, as a recent writedown has proved). Apparently the acquisition of specialist search companies by corporate behemoths will drive innovation – that is, unless that specialist knowledge vanishes into the behemoth’s Big Data strategy, never to be seen again. Woe betide the past customers that have to get used to a brand new pricing, availability and support plan as well.

Luckily it wasn’t all bad – there were some sensible viewpoints on the need for better interaction with the user, the rise of semantic analysis and how the rise of open source is driving out inefficiency in the market – but the article is absolutely peppered with buzzwords (Big Data being the most prevalent, of course) and contains some odd cliches: “I think a generation of people believes the computer should respond like HAL 9000″…didn’t HAL 9000 kill most of the crew and attempt to lock the survivor outside the airlock?

I’m pretty sure this isn’t a feature we want to replicate in an Enterprise Search system.

Tags: , , , ,

Posted in News

May 15th, 2014

1 Comment »

Cambridge Search Meetup – Cassandra & Solr

A sunny evening last night for the latest Cambridge Search Meetup, which featured a couple of talks from Datastax on the highly scalable NoSQL database Apache Cassandra and how it is integrated with Apache Lucene/Solr. Jeremy Hanna started us off with a brief history of the Facebook-incubated Cassandra, which is a fully distributed, highly reliable system used by many including Netflix and Spotify with some customers running thousands of nodes in multiple data centres. Cassandra has its own SQL-like language, CQL3 and some basic collections such as Lists and Maps, but due to its fully distributed nature does lack some traditional features such as JOINs. Datastax themselves are now responsible for most of the ongoing work on Cassandra and offer the usual array of training, support, management services and tools. One common application mentioned was high speed and reliable recording of sensor data, increasingly important now with the rise of the Internet of Things.

After a short break for drinks and snacks (which this time were kindly sponsored by Datastax) Sergio Bossa told us how Solr is integrated with Cassandra, also running in a distributed fashion. Interestingly, this integration doesn’t use the same Zookeeper system as SolrCloud (the standard way to run clusters of Solr servers) but relies instead on Cassandra’s own internal scaling systems, passing data about using ‘gossip‘ between nodes. Zookeeper is not always the easiest thing to get running so an alternative is very interesting! Data can be added to the system over HTTP or the aforementioned CQL3 and after being entered into Cassandra’s tables is subsequently indexed by Solr. Queries can then be made over HTTP as usual. Some work is still necessary to prevent duplication of effort (at present one needs to create data structures in Cassandra and subsequently in Solr).

It was pleasing so see that so much care has been taken with this integration process and also that Datastax offer their Datastax Enterprise Search stack not only free for non-production use, but free to startups. Thanks to Jeremy, Sergio and all who came along and we’ll be back with another Search Meetup soon.

Enterprise Search Europe 2014 day 2 – futures, text mining and images

Staying over in London due to the aforementioned tube strike proved to be a good idea and a large fried breakfast an even better one, so I arrived at the second day of the conference right on time and ready for the second day’s keynote by Jeff Fried of BA Insight and Professor Elaine Toms from Sheffield University, who hadn’t met before the event but spoke in turn on the Future of Search. Jeff’s expert and challenging view included some depressing statistics (only 4-5% of search projects succeed completely) and a description of an all-too-familiar ‘Search Immaturity Cycle’ – buy search technology, build application, discover it’s failing, attempt to work out why and then give up and try a new search technology. The positive side of his argument was that real progress has been made – search has a much lower TCO (in part due to the rise of open source), is more widely used and is far easier to administrate and run. He also mentioned some groundbreaking projects attempting to ‘understand the world’ that should inspire better enterprise search – IBM’s Watson and Wolfram Alpha.

After a brief but friendly argument with Jeff about sport(s) Elaine took over with the view from academia – describing the various academic disciplines linked to search and how they sometimes fail to link up, and how we have attempted with limited success to transplant a highly structured way of dealing with data onto the essentially unstructured real world, without taking proper notice of the wide variety of contexts (i.e. the myriad influences acting on people in the working environment that affect their information needs). She told us that we should work towards ‘providing the right information at the point of decision making’ and must identify the work task we are trying to assist, developing small and single-purpose applications based on search. Jeff, returning to the stage, told us that search itself will disappear as the pure functionality becomes ubiquitous and invisible. I’m not sure I agree about intelligent assistants though, I thought we’d killed that idea a long time ago (and Autonomy never had much luck with their Kenjin application – I was working on something similar at the time).

Next I popped in to hear Michael Upshall talk about the various text mining methods available and how they were investigated for CABI, including an interesting project Plantwise allowing farmers to find out which pest might affect their crops. I missed the next talk as I had some work to catch up on but returned to hear Dr. Haiming Liu list various multimedia search resources, some better than others: as she said there’s a large ’semantic gap’ with most of these services and they work best in constrained domains. The final presentation of the day for me was from Martin Dotter and Olaf Peters about a large-scale project to develop content processing for Airbus’ enterprise search engine – again, the scale was very impressive, with over 80,000 users and 4000 business applications in Airbus’ IT landscape. They described how they had developed a detailed process for gathering data from all the various content repositories and owners, resulting in a 44 million document index.

I had to leave before the last panel unfortunately so missed Jeff and Elaine’s re-take on the future of search. This year’s event was in my view the best since Enterprise Search Europe began: some great talks, informative and friendly networking and flawless organisation. Thanks to everyone involved and see you next time! Remember most of the slides are available here.

Enterprise Search Europe 2014 day 1 – Decisions, research and a Meetup quiz

This year’s Enterprise Search Europe was held near Victoria train station in London and unfortunately coincided with a two day strike on the London Underground – worrying for the organisers, but apart from a few notable absences it didn’t seem to affect the attendance too much. We started with a keynote from Dale Roberts, whose book on Decision Sourcing inspired a talk about a ‘rational decision making model’. When examining traditional relational database applications Dale said ‘if you peer at it long enough you can see the rows and columns’ and his point was that modern consumer social networking applications don’t exhibit this old pattern – so this is where search application designers should look for inspiration. His co-presenter Rooven Pakkiri said that Enterprise Search should attempt to ‘release the information from inside our heads’, which of course social networking might help with, connecting you with colleagues. I’m not sure that one can easily take lessons learnt from consumer applications and apply them to business use, and some later speakers agreed with me, but this was a high-energy and thought-provoking start.

Next I chaired the Open Source track, where we started with Cedric Ulmer of France Labs, who talked about a search application they built for a consultancy business with around 40 employees. Using Apache Solr, Apache ManifoldCF and their own Datafari open source framework they turned this project around very quickly – interestingly, the end clients needed no training to use the new system, which implies a very well designed UI. Our second talk from Ronald Hobbs of Reed Business International described a project on a much larger scale: 100 million documents, 72 business units and up to 190 queries per second – this was originally served by the FAST ESP engine but they moved to an Apache Solr system, replacing the FAST processing pipeline with Search Technologies Aspire project. His five steps for an effective migration (Prepare, Get the right tools, Get the right team, Migrate in chunks, Clean up) I can only agree with from our own experience of such projects, including one from FAST ESP to Solr. I was amused by his description of the Apache Zookeeper project as ‘a bipolar manic depressive’, although it seemed this was eventually overcome with a successful deployment on Amazon EC2. Next was Galina Hinova of Intrafind on a aftersales search application for MAN Truck and Bus – again at serious scale (MAN have around 1 billion vehicles in existence with 100-150 documents related to each). Interestingly the Euro6 regulations for emissions and standardized EU terms for automobile parts were direct drivers of the project, with Apache Lucene as the base technology. No longer is open source search just for small-scale projects it seems!

After a short break during which I chatted to John Newton, founder of Documentum Alfresco, and his team we returned to hear Dan Jackson give a description of how UCL had improved their website search – with a chaotic mix of low quality content and an ‘awful’ content management system, the challenges were myriad but with the help of experts such as our associate Tony Russell-Rose they have made significant improvements. Next was what was to prove a very popular talk from Nick Brown of AstraZeneca on a huge, well funded project to build applications to support research and development – again, this was at large scale with 75 million documents (including ‘all the patents and all the research papers’). The key here was their creation of many well-targeted ‘apps’ to enable particular uses of the Sinequa search engine they chose for the back end, including mobile apps to help find others in the company (or external to it) who are also working on a particular drug or disease. This presentation showed just what can be achieved if companies really understand the potential of search technology – knowledge sharing and discovery of previously unknown information.

After a short drinks reception we retired to a nearby pub for the combined Cambridge and London Search Meetup – I’d prepared a short quiz (feel free to have a go!) which was won by Tony Russell-Rose’s team. Networking and chatting continued long into the evening, with some people from the wider UK search community also attending.

To be continued! You can see most of the slides here.

ISKO UK – Taming the News Beast

I spent yesterday afternoon at UCL for ISKO UK’s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it up in a nice secure ontology. There were around 90 people attending from news, content, technology and academic organisations, including quite a few young journalism students visiting London from Missouri.

The first talk was by Matt Shearer of BBC News Labs who described how they are working on automatically extracting entities from video/audio content (including verbatim transcripts, contributors using face/voice recognition, objects using audio/image recognition, topics, actions and non-verbal events including clapping). Their prototype ‘Juicer’ extractor currently works with around 680,000 source items and applies 5.7 million tags – which represents around 9 man years for a manual tagger. They are using Stanford NLP and DBpedia heavily, as well as an internal BBC project ‘Mango’ – I hope that some of the software they are developing is eventually open sourced as after all this is a publically-funded broadcaster. His colleague Jeremy Tarling was next and described a News Storyline concept they had been working on a new basis for the BBC News website (which apparently hasn’t changed much in 17 years, and still depends on a lot of manual tagging by journalists). The central concept of a storyline (e.g. ‘US spy scandal’) can form a knowledge graph, linked to events (‘Snowden leaves airport’), videos, ‘explainer’ stories, background items etc. Topics can be used to link storylines together. This was a fascinating idea, well explained and something other news organisations should certainly take note of.

Next was Rob Corrao of LAC Group describing how they had helped ABC News revolutionize their existing video library which contains over 2 million assets. They streamlined the digitization process, moved little-used analogue assets out of expensive physical storage, re-organised teams and shift patterns and created a portal application to ease access to the new ‘video library as a service’. There was a focus on deep reviews of existing behaviour and a pragmatic approach to what did and didn’t need to be digitized. This was a talk more about process and management rather than technology but the numbers were impressive: at the end of the project they were handling twice the volume with half the people.

Ian Roberts from the University of Sheffield then described AnnoMarket, a cloud-based market platform for text analytics, which wraps the rather over-complex open source GATE project in an API with easy scalability. As they have focused on precision over recall, AnnoMarket beats other cloud-based NLP services such as OpenCalais and TextRazor in terms of accuracy, and can process impressive volumes of documents (10 million in a few hours was quoted). They have developed custom pipelines for news, biomedical and Twitter content with the former linked into the Press Association’s ontology (PA is a partner in AnnoMarket). For those wanting to carry out entity extraction and similar processes on large volumes of content at low cost AnnoMarket certainly looks attractive.

Next was Pete Sowerbutts of PA on the prototype interface he had helped develop for tagging all of PA’s 3000 daily news stories with entity information. I hadn’t known how influential PA is in the UK news sector – apparently 30% of all UK news is a direct copy of a PA feed and they estimate 70% is influenced by PA’s content. The UI showed how entities that have been automatically extracted can be easily confirmed by PA’s staff, allowing for confirmation that the right entity is being used (the example being Chris Evans who could be both a UK MP, a television personality and an American actor). One would assume the extractor produces some kind of confidence measure which begs the question whether every single entity must be manually confirmed – but then again, PA must retain their reputation for high quality.

The event finished with a brief open discussion featuring some of the speakers on an informal panel, followed by networking over drinks and snacks. Thanks to all at ISKO especially Helen Lippell for organising what proved to be a very interesting day.