Posts Tagged ‘News’

Rebrands and changing times for Elasticsearch

I’ve always been careful to distinguish between Elasticsearch (the open source search server based on Lucene) and Elasticsearch (the company formed by the authors of the former) and it seems someone was listening, as the latter has now rebranded as simply Elastic. This was one of the big announcements during their first conference, the other being that after acquiring Norwegian company Found they are now offering a fully hosted Elasticsearch-as-a-service (congratulations to Alex and others at Found!). As Ben Kepes of Forbes writes, this may be something to do with ‘managing tensions within the ecosystem’ (I’ve written previously on how this ecosystem is expanding to include closed-source commercial products, which may make open source enthusiasts nervous) but it’s also an attempt to move away from ’search’ into a wider area encompassing the buzzwords-de-jour of Big Data Analytics.

In any case, it’s clear that Elastic (the company, and that’s hopefully the last time I’ll have to write this!) have a clear strategy for the future – to provide many different commercial options for Elasticsearch and its related projects for as many different use cases as possible. Of course, you can still take the open source route, which we’re helping several clients with at present – I hope to be able to present a case study on this very soon.

Meanwhile, Martin White has identified how a recent book on Elasticsearch describes literally hundreds of features and that ‘The skill lies in knowing which to implement given the nature of the content and the type of query that will be used’ – effective search, as ever, remains a difficult thing to get right, no matter what technology option you choose.

UPDATE: It seems that, the website for the open source project, is now redirecting to the commercial company website…there is now a new Github page for open source code at

Tags: , , , ,

Posted in News

March 11th, 2015

1 Comment »

How not to predict the future of search

I’ve just seen an article titled Enterprise Search: 14 Industry Experts Predict the Future of Search which presents a list of somewhat contradictory opinions. I’m afraid I have some serious issues with the experts chosen and the undeniably blinkered views some of them have presented.

Firstly, if you’re going to ask a set of experts to write about Enterprise Search, don’t choose an expert in SEO as part of your list. SEO is not Enterprise Search, in fact a lot of the time it isn’t anything at all (except snake oil) – it’s a way of attempting to game the algorithms of web search engines. Secondly, at least make some attempt to prevent your experts from just listing the capabilities of their own companies in their answers: in fact one ‘expert’ was actually a set of PR-friendly answers from a company rather than a person, including listing articles about their own software. The expert from Microsoft rather predictably failed to notice the impact of open source on the search market, before going on to put a positive spin on the raft of acquisitions of search companies over the last few years (and it’s certainly not all good, as a recent writedown has proved). Apparently the acquisition of specialist search companies by corporate behemoths will drive innovation – that is, unless that specialist knowledge vanishes into the behemoth’s Big Data strategy, never to be seen again. Woe betide the past customers that have to get used to a brand new pricing, availability and support plan as well.

Luckily it wasn’t all bad – there were some sensible viewpoints on the need for better interaction with the user, the rise of semantic analysis and how the rise of open source is driving out inefficiency in the market – but the article is absolutely peppered with buzzwords (Big Data being the most prevalent, of course) and contains some odd cliches: “I think a generation of people believes the computer should respond like HAL 9000″…didn’t HAL 9000 kill most of the crew and attempt to lock the survivor outside the airlock?

I’m pretty sure this isn’t a feature we want to replicate in an Enterprise Search system.

Tags: , , , ,

Posted in News

May 15th, 2014

1 Comment »

ISKO UK – Taming the News Beast

I spent yesterday afternoon at UCL for ISKO UK’s event on Taming the News Beast – I’m not sure if we found out how to tame it but we certainly heard how to festoon it with metadata and lock it up in a nice secure ontology. There were around 90 people attending from news, content, technology and academic organisations, including quite a few young journalism students visiting London from Missouri.

The first talk was by Matt Shearer of BBC News Labs who described how they are working on automatically extracting entities from video/audio content (including verbatim transcripts, contributors using face/voice recognition, objects using audio/image recognition, topics, actions and non-verbal events including clapping). Their prototype ‘Juicer’ extractor currently works with around 680,000 source items and applies 5.7 million tags – which represents around 9 man years for a manual tagger. They are using Stanford NLP and DBpedia heavily, as well as an internal BBC project ‘Mango’ – I hope that some of the software they are developing is eventually open sourced as after all this is a publically-funded broadcaster. His colleague Jeremy Tarling was next and described a News Storyline concept they had been working on a new basis for the BBC News website (which apparently hasn’t changed much in 17 years, and still depends on a lot of manual tagging by journalists). The central concept of a storyline (e.g. ‘US spy scandal’) can form a knowledge graph, linked to events (‘Snowden leaves airport’), videos, ‘explainer’ stories, background items etc. Topics can be used to link storylines together. This was a fascinating idea, well explained and something other news organisations should certainly take note of.

Next was Rob Corrao of LAC Group describing how they had helped ABC News revolutionize their existing video library which contains over 2 million assets. They streamlined the digitization process, moved little-used analogue assets out of expensive physical storage, re-organised teams and shift patterns and created a portal application to ease access to the new ‘video library as a service’. There was a focus on deep reviews of existing behaviour and a pragmatic approach to what did and didn’t need to be digitized. This was a talk more about process and management rather than technology but the numbers were impressive: at the end of the project they were handling twice the volume with half the people.

Ian Roberts from the University of Sheffield then described AnnoMarket, a cloud-based market platform for text analytics, which wraps the rather over-complex open source GATE project in an API with easy scalability. As they have focused on precision over recall, AnnoMarket beats other cloud-based NLP services such as OpenCalais and TextRazor in terms of accuracy, and can process impressive volumes of documents (10 million in a few hours was quoted). They have developed custom pipelines for news, biomedical and Twitter content with the former linked into the Press Association’s ontology (PA is a partner in AnnoMarket). For those wanting to carry out entity extraction and similar processes on large volumes of content at low cost AnnoMarket certainly looks attractive.

Next was Pete Sowerbutts of PA on the prototype interface he had helped develop for tagging all of PA’s 3000 daily news stories with entity information. I hadn’t known how influential PA is in the UK news sector – apparently 30% of all UK news is a direct copy of a PA feed and they estimate 70% is influenced by PA’s content. The UI showed how entities that have been automatically extracted can be easily confirmed by PA’s staff, allowing for confirmation that the right entity is being used (the example being Chris Evans who could be both a UK MP, a television personality and an American actor). One would assume the extractor produces some kind of confidence measure which begs the question whether every single entity must be manually confirmed – but then again, PA must retain their reputation for high quality.

The event finished with a brief open discussion featuring some of the speakers on an informal panel, followed by networking over drinks and snacks. Thanks to all at ISKO especially Helen Lippell for organising what proved to be a very interesting day.

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Searching for (and finding) open source in the UK Government

There have been some very encouraging noises recently about increased use of open source software by the UK Government: for example we’ve seen the creation of an Open Source Procurement Toolkit by the Cabinet Office, which lists Xapian and Apache Lucene/Solr as alternatives to the usual closed source options. The CESG, the “UK Government’s National Technical Authority for Information Assurance”, has clarified its position on open source software, which has led to the Cabinet Office dispelling some of the old myths about security and open source. We know that the Cabinet Office’s ’skunkworks’, the Government Digital Service, are using Solr for several of their projects. Francis Maude MP was recently in the USA with some of the GDS team and visited amongst others our US partners Lucid Imagination.

The British Computer Society have helped organise a series of Awareness Events for civil servants and I’m glad to be speaking at the first of these next Tuesday 21st February on open source search – hopefully this will further increase the momentum and make it even more clear that a modern Government needs to consider this modern, flexible and economically scalable approach to software.

Tags: , , , , , , ,

Posted in News, events

February 17th, 2012

No Comments »

Mixed reactions as HP buys Autonomy

The blogotweetosphere has been positively buzzing since last night’s announcement that Hewlett Packard will be buying Autonomy for £7.1bn, while divesting itself of its PC business. Many commentators have put a positive spin on this, pointing to Autonomy’s meteoric rise from a small office in Cambridge to the behemoth it is today. It’s undoubtedly good news for Autonomy’s shareholders. Dave Kellogg correctly identifies Autonomy as a “finance company dressed in (meaning-based) technology company clothing” with a “happy ending”.

However the reaction isn’t all positive – the FT implies this deal is at the “lunatic end of the valuation spectrum”. Law Technology News says “Autonomy’s e-discovery revenue stream is high-end but unsustainable” and quotes users of the system with problems: “We had a lot of issues with the applications crashing, the documents tending not to get checked in”….”"[Autonomy sales staff] were pricey, arrogant, and they couldn’t care less about us. … It cannot get any worse.”.

HP will have to work hard to integrate Autonomy into both its corporate culture and software frameworks – a problem currently faced by Microsoft since its acquisition of FAST a short while ago. Stephen Arnold thinks this process will be “risky”. What it means for the rest of the search sector is harder to guess, although Martin White of Intranet Focus says this deal indicates HP can see a “future in search applications” and, interestingly, “A number of privately-held search vendors are probably working out what their valuation would be”.

My view is that this is just the latest of huge shifts in the enterprise search market, partly spurred on by the rise of open source options and the gradual realisation that the huge license fees charged by some vendors may be unsustainable. I don’t think Autonomy will be the last company looking for a safe haven in the years to come.

Tags: , , , , , ,

Posted in Business, News

August 19th, 2011

No Comments »

UK Government IT – a closed shop to SMEs and OSS?

There’s a lot of buzz currently around the UK government and its approach to IT projects (which has been historically rather poor in terms of delivery, schedules and cost). We’ve written before about an Action Plan that recommends open source and open standards, but it seems that actually implementing these is more of a problem, especially when you consider (flexible and more agile) smaller suppliers such as ourselves who may not even get a chance to compete for the business.

There’s an inquiry running currently that promises to look at this, and they have invited various people to put their views across. Unfortunately with one laudable exception these people were from (or mainly represent) very large IT companies who already supply the government and whose interest lies in maintaining the status quo.

As Mark Taylor of Sirius has already pointed out, this situation isn’t going to change until government procurement itself becomes an open process, so that we can all see how much could be wasted on outdated project management methods and overpriced closed source software.

Tags: , ,

Posted in News

March 18th, 2011

No Comments »

Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup

Last night we held the first of what we hope will be a series of Meetups in our home town of Cambridge, U.K. Attending were researchers, developers and entrepreneurs in the field of search – as is the norm in Cambridge many had cycled to the venue, and there was a friendly and informal feel to the group.

We started with my presentation on “Searching news media with open source software”, where I talked about our work for the NLA, Financial Times and others. We followed with John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”. John showed a Grapeshot project for Virgin where different media assets can be automatically grouped together even if they have different metadata – for example an episode of the TV show “Heroes” is basically the same object whether it is broadcast, video-on-demand or a repeat, but differs from the Bowie album of the same name.

We then broke up for discussion (and beer) – great to catch up with some ex-colleagues and meet others for the first time. Downstairs there was live music and one of our colleagues even joined the band for a spell on drums! From the feedback we recieved there’s definitely interest in repeating the event, if you’d like to attend next time please join the Meetup group.

Tags: , , , ,

Posted in events

February 17th, 2011

1 Comment »

Cambridge Enterprise Search Meetup tomorrow

A quick reminder that our first Cambridge Enterprise Search Meetup is tomorrow, February 16th from 6.30pm. More details in my previous post. We now have two talks, one from myself on “Open Source Search for News” and one from John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”.

If you’re able to come please let us know using the Meetup website so we can organise enough refreshments!

Tags: , ,

Posted in events

February 15th, 2011

1 Comment »

Ovum says – why bother with closed source search?

Analysts Ovum have released a report on enterprise search – it’s not clear where to obtain it yet, although Report Linker may have it available. According to one report it may also be called “Enterprise Search and Retrieval: Exploiting all of the Organisation’s Information Assets”.

Interestingly most of the press coverage around the release is focussing on the author, Mike Davis’s statements about open source solutions – in particular “…in fact, companies should only go to the big proprietary players if open source can’t deliver what they need. “. He also states that “there are mere nuances between those ranked” – and this includes the open source option Solr 1.4.

This is the clearest statement yet from an analyst that enterprise search engines are all pretty much the same thing, if you strip away the marketing – but more importantly, that open source should be the first option to consider.

Tags: , , ,

Posted in News

January 21st, 2011

1 Comment »