autonomy – Flax

Out and about in search & monitoring – Autumn 2015

Charlie Hull — Wed, 16 Dec 2015 10:24:42 +0000

It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning

Charlie Hull — Wed, 04 Nov 2015 11:48:33 +0000

I’ll be speaking at several events over the next few weeks, in the UK and abroad. On the 19th of November I’ll be at the FIBEP World Media Intelligence Congress in Vienna, to talk about how we helped our client Infomedia migrate from a closed-source search engine (Autonomy IDOL and Verity) to a new platform based on Apache Lucene/Solr and our own Luwak stored search library. Infomedia are Denmark’s leading provider of media monitoring and analysis and wanted to future-proof their search platform: we’ll talk about open source makes this possible and how we implemented stored search, handled highly complex queries and how the new platform is scalable and flexible.

On the 25th I’ll be presenting at the London Elasticsearch Usergroup with our client Westcoast, who we have been helping with an Elasticsearch implementation. Westcoast are a B2B supplier of electronics and white goods with yearly revenues of over £1billion, and we’ve helped them implement a powerful new search engine for their website. E-commerce is one sector where good search is an essential part of driving revenue.

Next, on the 26th I’ll be talking one of my favourite events of the year, the British Computer Society Information Retrieval Specialist Group’s Search Solutions, on how we might improve how search engine relevance is tested. I’ll suggest a more formal process of test-based relevance tuning and show some useful tools. Our client NLA media access are also talking about the new Clipshare platform we built on Apache Lucene/Solr.

Do let me know if you’re attending and would like to chat – I’ll also be publishing slides and more information about the projects above soon.

The post Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning appeared first on Flax.

London Lucene/Solr Usergroup – website search and indexing the cloud

Charlie Hull — Fri, 11 Sep 2015 08:58:52 +0000

This week’s London Lucene/Solr Meetup was hosted by asset management company BlackRock who also provided our first speakers. BlackRock manages an astonishing $4.7 trillion in assets (that’s more than the GDP of Germany) and operates 90 different websites with around 250,000 content items, so a good and accurate website search engine is essential. Although BlackRock use HP Autonomy‘s content management system and IDOL search engine, the latter is hard to tune (‘not deterministic, and why it ranks the way it does can be mysterious’) and Ife Nkechukwu and Erica Sundberg have been investigating Apache Solr as an alternative: being open source and with a powerful debugging features, Solr allows complete understanding of why a particular result is scored and ranked.

Starting with this great video (it’s from Google not BlackRock, but amusing and worth a look), Ife and Erica gave an engaging and clear presentation of their journey with Solr: how they explored the various options for crawling (Nutch and Heritrix were mentioned), how Analyzers are used to condition content for indexing and how Solr scoring ranking is actually calculated. This was one of the best ‘how to get started with Solr’ presentations I have seen and I was also very pleased to hear Ife say ‘you can’t just build search and forget it – you have to tune search like an instrument’ – entirely consistent with our own experience.

After a quick pizza break, Jim Liddle of Storage Made Easy was next up. Jim’s company provides appliances that connect to a myriad of cloud storage systems and provide a number of services (collaboration, sharing, governance, search) accessible via any computing or mobile device. Jim told us how they’d integrated Solr into their system to provide deep content search and filtering. Interestingly, Storage Made Easy chose Solr over Elasticsearch because they are ‘not quite sure where Elastic will end up in terms of commercials’ – even though Jim worked with Shay Banon (creator of Elasticsearch) at Gigaspaces. You can see Jim’s slides here where he explains how the hardest task was indexing permissions data. I was particularly interested in the ‘visual query builder’ they had developed for clients with very complex search requirements – this chimed with our own experience of working with complex media monitoring queries.

We finished with a Solr Q&A (Upayavira was kind enough to provide many of the answers) – BlackRock had kindly provided a prize for the best question (a mini quadcopter) – our winner was very happy! Thanks again to our hosts and presenters and I look forward to seeing you all again soon.

The post London Lucene/Solr Usergroup – website search and indexing the cloud appeared first on Flax.

A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access

Charlie Hull — Tue, 17 Feb 2015 11:25:26 +0000

Stephen Arnold, whose blog I enjoy due to its unabashed cynicism about overenthusiastic marketing of search technology, was kind enough to send me a copy of his recent report on CyberOSINT & Next Generation Information Access (NGIA), the latter being a term he has recently coined. OSINT itself refers to intelligence gathered from open, publically available sources, not anything to do with software licenses – so yes, this is all about the NSA, CIA and others, who as you might expect are keen on anything that can filter out the interesting from the noise. Let’s leave the definition (and the moral questionability) of ‘publically available’ aside for now – even if you disagree with its motives, this is a use case which can inform anyone with search requirements of the state of the art and what the future holds.

The report starts off with a foreword by Robert David Steele, who has had a varied and interesting career and lately has become a cheerleader for the other kind of open source – software – as a foundation for intelligence gathering. His view is that the tools used by the intelligence agencies ‘are also not good enough’ and ‘We have a very long way to go’. Although he writes that ‘the systems described in this volume have something to offer’ he later concludes that ‘This monograph is a starting point for those who might wish to demand a “full spectrum” solution, one that is 100% open source, and thus affordable, interoperable, and scalable.’ So for those of us in the open source sector, we could consider Arnold’s report as a good indicator of what to shoot for, a snapshot of the state of the art in search.

Arnold then starts the report with some explanation of the NGIA concept. This is largely a list of the common failings of traditional search platforms (basic keyword search, oft-confusing syntax, separate silos of information, lack of multimedia features and personalization) and how they might be addressed (natural language search, automatic querying, federated search, analytics). I am unconvinced this is as big a step as Arnold suggests though: it seems rather to imply that all past search systems were badly set up and configured and somehow a NGIA system will magically pull everything together for you and tell you the answer to questions you hadn’t even asked yet.

Disappointingly the exemplar chosen in the next chapter is Autonomy IDOL: regular readers will not be surprised by my feelings about this technology. Arnold suggests the creation of the Autonomy software was influenced by cracking World War II codes, rock music and artificial intelligence, which is in my mind adding egg to an already very eggy pudding, and not in step with what I know about the background of Cambridge Neurodynamics (Autonomy’s progenitor, created very soon after – and across the corridor from – Muscat, another Cambridge Bayesian search technology firm where Flax’s founders cut their teeth on search). In particular, Autonomy’s Kenjin tool – which automatically suggested related documents – is identified as a NGIA feature, although at the time I remember it being reminiscent of features we had built a year earlier at Muscat – we even applied for a patent. Arnold does note that ‘[Autonomy founder, Mike] Lynch and his colleagues clamped down on information about the inner workings of its smart software.’ and ‘The Autonomy approach locks down the IDOL components.’ – this was a magic black box of course, with a magically increasing price tag as well. The price tag rose to ridiculous dimensions (even after an equally ridiculous writedown) when Hewlett Packard bought the company.

The report continues with analysis of various other potential NGIA contenders, including Google-funded timeline analysis specialists Recorded Future and BAE Detica – interestingly one of the search specialists from this British company has now gone on to work at Elasticsearch.

The report concludes with a look at the future, correctly identifying advanced analytics as one key future trend. However this conclusion also echoes the foreword, with ‘The cost of proprietary licensing, maintenance, and training is now killing the marketplace. Open source alternatives will emerge, and among these may be a 900 pound gorilla that is free, interoperable and scalable.’. Although I have my issues with some of the examples chosen, the report will be very useful I’m sure to those in the intelligence sector, who like many are still looking for search that works.

The post A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access appeared first on Flax.

A new Meetup for Lucene & Solr

Charlie Hull — Mon, 01 Dec 2014 13:41:02 +0000

Last Friday we held the first Meetup for a new Apache Lucene/Solr User Group we’ve recently created (there’s a very popular one for Elasticsearch so it seemed only fair Solr had its own). My co-organiser Ramkumar Aiyengar of Bloomberg provided the venue – Bloomberg’s huge and very well-appointed presentation space in their headquarters building off Finsbury Square, which impressed attendees. As this was the first event we weren’t expecting huge numbers but among the 25 or so attending were glad to see some from Flax clients including News UK, Alfresco and Reed.co.uk.

Shalin Mangar, Lucene/Solr committer and SolrCloud expert started us off with a Deep Dive into some of the recent work performed on testing resilience against network failures. Inspired by this post about how Elasticsearch may be subject to data loss under certain conditions (and to be fair I know the Elasticsearch team are working on this), Shalin and his colleagues simulated a number of scary-sounding network fault conditions and tested how well SolrCloud coped – the conclusion being that it does rather well, with the Consistency part of the CAP theorem covered. You can download the Jepsen-based code used for these tests from Shalin’s employer Lucidworks own repository. It’s great to see effort being put into these kind of tests as reliable scalability is a key requirement these days.

I was up next to talk briefly about a recent study we’ve been doing into a performance comparison between Solr and Elasticsearch. We’ll be blogging about this in more detail soon, but as you can see from my colleague Tom Mortimer’s slides there aren’t many differences, although Solr does seem to be able to support around three times the number of queries per second. We’re very grateful to BigStep (who offer some blazingly fast hosting for Elasticsearch and other platforms) for assisting with the study over the last few weeks – and we’re going to continue with the work, and publish our code very soon so others can contribute and/or verify our findings.

Next I repeated my talk from Enterprise Search and Discovery on our work with media monitoring companies on scalable ‘inverted’ search – this is when one has a large number of stored queries to apply to a stream of incoming documents. Included in the presentation was a case study based on our work for Infomedia, a large Scandinavian media analysis company, where we have replaced Autonomy IDOL and Verity with a more scalable open source solution. As you might expect the new system is based on Apache Lucene/Solr and our Luwak library.

Thanks to Shalin for speaking and all who came – we hope to run another event soon, do let us know if you have a talk you would like to give, can offer sponsorship and/or a venue.

The post A new Meetup for Lucene & Solr appeared first on Flax.

More than an API – the real third wave of search technology

Charlie Hull — Tue, 18 Nov 2014 12:28:22 +0000

I recently read a blog post by Karl Hampson of Realise Okana (who offer HP Autonomy and SRCH2 as closed source search options) on his view of the ‘third wave’ of search. The second wave he identifies (correctly) as open source, admitting somewhat grudgingly that “We’d heard about Lucene for years but no customers seemed to take it seriously until all of a sudden they did”. However, he also suggests that there is a third wave on its way – and this is led by HP with its IDOL OnDemand offering.

I’m afraid to say I think that IDOL OnDemand is in fact neither innovative or market leading – it’s simply an API to a cloud hosted search engine and some associated services. Amazon Cloudsearch (originally backed by Amazon’s own A9 search engine, but more recently based on Apache Solr) offers a very similar thing, as do many other companies including Found.no and Qbox with an Elasticsearch backend. For those with relatively simple search requirements and no issues with hosting their data with a third party, these services can be great value. It is however interesting to see the transition of Autonomy’s offering from a hugely expensive license fee (plus support) model to an on-demand cloud service: the HP acquisition and the subsequent legal troubles have certainly shaken things up! At a recent conference I heard a HP representative even suggest that IDOL OnDemand is ‘free software’ which sounds like a slightly desperate attempt to jump on the open source bandwagon and attract some hacker interest without actually giving anything away.

So if a third wave of search technology does exist, what might it actually be? One might suggest that companies such as Attivio or our partners Lucidworks, with their integrated solutions built on proven and scalable open source cores and folding in Hadoop and other Big Data stacks, are surfing pretty high at present. Others such as Elasticsearch (the company) are offering advanced analytical capabilities and easy scalability. We hear about indexes of billions of items, thousands of separate indexes : the scale of some of these systems is incredible and only economically possible where license fees aren’t a factor. Across our own clients we’re seeing searches across huge collections of complex biological data and monitoring systems handling a million new stories a day. Perhaps the third wave of search hasn’t yet arrived – we’re just seeing the second wave continue to flood in.

One interesting potential third wave is the use of search technology to handle even higher volumes of data (which we’re going to receive from the Internet of Things apparently) – classifying, categorising and tagging streams of machine-generated data. Companies such as Twitter and LinkedIn are already moving towards these new models – Unified Log Processing is a commonly used term. Take a look at a recent experiment in connecting our own Luwak stored query library to Apache Samza, developed at LinkedIn for stream processing applications.

The post More than an API – the real third wave of search technology appeared first on Flax.

Enterprise Search & Discovery 2014, Washington DC

Charlie Hull — Wed, 12 Nov 2014 10:49:57 +0000

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ‘search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden‘s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion‘s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper.

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

The post Enterprise Search & Discovery 2014, Washington DC appeared first on Flax.

Enterprise Search Europe 2014 day 2 – futures, text mining and images

Charlie Hull — Fri, 02 May 2014 13:45:29 +0000

Staying over in London due to the aforementioned tube strike proved to be a good idea and a large fried breakfast an even better one, so I arrived at the second day of the conference right on time and ready for the second day’s keynote by Jeff Fried of BA Insight and Professor Elaine Toms from Sheffield University, who hadn’t met before the event but spoke in turn on the Future of Search. Jeff’s expert and challenging view included some depressing statistics (only 4-5% of search projects succeed completely) and a description of an all-too-familiar ‘Search Immaturity Cycle’ – buy search technology, build application, discover it’s failing, attempt to work out why and then give up and try a new search technology. The positive side of his argument was that real progress has been made – search has a much lower TCO (in part due to the rise of open source), is more widely used and is far easier to administrate and run. He also mentioned some groundbreaking projects attempting to ‘understand the world’ that should inspire better enterprise search – IBM’s Watson and Wolfram Alpha.

After a brief but friendly argument with Jeff about sport(s) Elaine took over with the view from academia – describing the various academic disciplines linked to search and how they sometimes fail to link up, and how we have attempted with limited success to transplant a highly structured way of dealing with data onto the essentially unstructured real world, without taking proper notice of the wide variety of contexts (i.e. the myriad influences acting on people in the working environment that affect their information needs). She told us that we should work towards ‘providing the right information at the point of decision making’ and must identify the work task we are trying to assist, developing small and single-purpose applications based on search. Jeff, returning to the stage, told us that search itself will disappear as the pure functionality becomes ubiquitous and invisible. I’m not sure I agree about intelligent assistants though, I thought we’d killed that idea a long time ago (and Autonomy never had much luck with their Kenjin application – I was working on something similar at the time).

Next I popped in to hear Michael Upshall talk about the various text mining methods available and how they were investigated for CABI, including an interesting project Plantwise allowing farmers to find out which pest might affect their crops. I missed the next talk as I had some work to catch up on but returned to hear Dr. Haiming Liu list various multimedia search resources, some better than others: as she said there’s a large ‘semantic gap’ with most of these services and they work best in constrained domains. The final presentation of the day for me was from Martin Dotter and Olaf Peters about a large-scale project to develop content processing for Airbus’ enterprise search engine – again, the scale was very impressive, with over 80,000 users and 4000 business applications in Airbus’ IT landscape. They described how they had developed a detailed process for gathering data from all the various content repositories and owners, resulting in a 44 million document index.

I had to leave before the last panel unfortunately so missed Jeff and Elaine’s re-take on the future of search. This year’s event was in my view the best since Enterprise Search Europe began: some great talks, informative and friendly networking and flawless organisation. Thanks to everyone involved and see you next time! Remember most of the slides are available here.

The post Enterprise Search Europe 2014 day 2 – futures, text mining and images appeared first on Flax.

ElasticSearch London Meetup – a busy and interesting evening!

Charlie Hull — Wed, 26 Feb 2014 13:44:43 +0000

I was lucky enough to attend the London ElasticSearch User Group’s Meetup last night – around 130 people came to the Goldman Sachs offices in Fleet Street with many more on the waiting list. It signifies quite how much interest there is in ElasticSearch these days and the event didn’t disappoint, with some fascinating talks.

Hugo Pickford-Wardle from Rely Consultancy kicked off with a discussion about how ElasticSearch allows for rapid ‘hard prototyping’ – a way to very quickly test the feasibility of a business idea, and/or to demonstrate previously impossible functionality using open source software. His talk focussed on how a search engine can help to surface content from previously unconnected and inaccessible ‘data islands’ and can help promote re-use and repurposing of the data, and can lead clients to understand the value of committing to funding further development. Examples included a new search over planning applications for Westminster City Council. Interestingly, Hugo mentioned that during one project ElasticSearch was found to be 10 times faster than the closed source (and very expensive) Autonomy IDOL search engine.

Next was Indy Tharmakumar from our hosts Goldman Sachs, showing how his team have built powerful support systems using ElasticSearch to index log data. Using 32 1 core CPU instances the system they have built can store 1.2 billion log lines with a throughput up to 40,000 messages a second (the systems monitored produce 5TB of log data every day). Log data is queued up in Redis, distributed to many Logstash processes, indexed by Elasticsearch with a Kibana front end. They learned that Logstash can be particularly CPU intensive but Elasticsearch itself scales extremely well. Future plans include considering Apache Kafka as a data backbone.

The third presentation was by Clinton Gormley of ElasticSearch, talking about the new cross field matching features that allow term frequencies to be summed across several fields, preventing certain cases where traditional matching techniques based on Lucene‘s TF/IDF ranking model can produce some unexpected behaviour. Most interesting for me was seeing Marvel, a new product from ElasticSearch (the company), containing the Sense developer console allowing for on-the-fly experimentation. I believe this started as a Chrome plugin.

The last talk, by Mark Harwood, again from ElasticSearch, was the most interesting for me. Mark demonstrated how to use a new feature (planned for the 1.1 release, or possibly later), an Aggregator for significant terms. This allows one to spot anomalies in a data set – ‘uncommon common’ occurrences as Mark described it. His prototype showed a way to visualise UK crime data using Google Earth, identifying areas of the country where certain crimes are most reported – examples including bike theft here in Cambridge (which we’re sadly aware of!). Mark’s Twitter account has some further information and pictures. This kind of technique allows for very powerful analytics capabilities to be built using Elasticsearch to spot anomalies such as compromised credit cards and to use visualisation to further identify the guilty party, for example a hacked online merchant. As Mark said, it’s important to remember that the underlying Lucene search library counts everything – and we can use those counts in some very interesting ways.
UPDATE Mark has posted some code from his demo here.

The evening closed with networking, pizza and beer with a great view over the City – thanks to Yann Cluchey for organising the event. We have our own Cambridge Search Meetup next week and we’re also featuring ElasticSearch, as does the London Search Meetup a few weeks later – hope to see you there!

The post ElasticSearch London Meetup – a busy and interesting evening! appeared first on Flax.

Time for the crystal ball again…

Charlie Hull — Tue, 07 Jan 2014 17:05:41 +0000

It’s always fun to make predictions about the future, especially as one can be pretty sure to be proved wrong in interesting ways. At the start of 2014 we at Flax are looking forward to another year of building open source search and we already have some great client projects in progress that we’ll shortly be able to talk about, but what else might be happening this year? Here’s some points to note:

The Elasticsearch project continues to add features at a prodigious rate during the arms race between it and Apache Solr – this battle can only be good news for end users in our view. We can expect a 1.0 release of Elasticsearch this year and several further major 4.x releases of Solr.
The Solr world has become slightly more complex as original author Yonik Seeley has left Lucidworks to start his own company, Heliosearch – with its own packaged distribution of Solr. How will Heliosearch contribute to the Solr ecosystem?
HP Autonomy is a sponsor of the Enterprise Search Europe conference this year, although there’s still some fallout from HP’s acquisition of Autonomy, and little news from the various official investigations into this process. Perhaps this year HP’s overall strategy will become a little clearer.
The Big Data bandwagon rolls on and more or less every search company now stresses its capabilities in this area for marketing purposes: but how big is Big? It’s not enough just to re-quote IDC’s latest study on how many exobytes everyone is producing these days, the value is in the detail, not the sheer volume: good (and deep) analytics is the key.
We think there might be some interesting things happening around open source search and bioinformatics soon – watch this space!

The post Time for the crystal ball again… appeared first on Flax.