Innovations in Knowledge Organisation, Singapore: a review

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore’s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge’s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

London Lucene/Solr Usergroup – Search Relevancy & Hacking Lucene with Doug Turnbull

Last week Doug Turnbull of US-based Open Source Connections visited the UK and spoke at our Meetup. His first talk was on Search Relevancy, an area that we often deal with at Flax: how to tune a search engine to give results that our clients deem relevant, without affecting the results for other queries. Using a client project as an example, Doug talked about how he created a tool to record relevance judgements for a set of queries (or a ‘case’). The underlying Solr search engine could then be adjusted and the tool re-runs the queries to show any change in the position of the scored results. Slides and video of the talk are available – thanks to our hosts SkillsMatter for these.

The tool, Quepid, is a great way to allow non-developers to score search results – in most cases we have seen, if this kind of testing is done at all it is recorded using spreadsheets. The tests then need to be re-run manually and scores updated, which can result in the tuning process taking far too long. This whole area is in need of some rigor and best practise, and to that end Doug is writing a book on Relevant Search which we’re very much looking forward to.

Doug’s second talk was on Hacking Lucene for custom search results, during which he dissected how Lucene queries actually work and how custom scoring algorithms can be used to change search ranking. Although highly technical in parts – and as Doug said, one of the hardest ways to write Lucene code to influence ranking and thus relevance – it was a great window on Lucene’s low level behaviour. Again, slides and video are available.

Thanks to all who came and especially Doug for coming so far to present his talks!

Tags: , , , ,

Posted in Technical, events

June 11th, 2015

No Comments »

Going international – open source search in London, Berlin & Singapore

We’re travelling a bit over the next few weeks to visit and speak at various events. This weekend Alan Woodward is at Berlin Buzzwords, a hacker-focused conference with a programme full of search talks. He’s not speaking this year, but if you want to talk about Lucene, Solr or our own Luwak stored search library and the crazy things you can do with it, do buy him a beer!

Next week we’re hosting another London Lucene/Solr User Group Meetup with Doug Turnbull of Open Source Connections. Doug is the author of a forthcoming book on Relevant Search and the creator of Quepid, a tool for gathering relevance judgements for Solr-based search systems and then seeing how these scores change as you tune the Solr installation. Tuning relevance is a very common (and often difficult) task during search projects and can make a significant difference to the user experience (and in particular, for e-commerce can hugely affect your bottom line) – so we’re very much looking forward to Doug’s talk.

The week after I’m in Singapore visiting the Innovations in Knowledge Organisation conference – a new event focusing on knowledge management and search. I’ve been asked to talk about open source search and to keynote the second day of the event and speak on ‘The Future of Search’. Do let me know if you’re attending and would like to meet up.

Tags: , , , , ,

Posted in events

May 29th, 2015

No Comments »

When bad search hurts: finding that elusive ROI

One thing I’ve noticed from many years attending search conferences is that the return on investment (ROI) in search technology is hard to calculate: this is particularly difficult when considering intranet and/or enterprise search, as users can usually find another way to answer their question. In most cases, some slightly tired numbers are trotted out from years-old studies, based on how much employee time a better search engine might save. I’ve never really believed this to be a sensible metric; however if you’re trying to sell a search solution (especially an overpriced, closed source, magic black box that ‘understands your content’) it may be all you can rely on.

We’ve been working recently with a couple of clients who sell significant volumes of products via their websites: in both cases the current search solution is underperforming. In this situation it’s far easier to justify an investment in better search: if customers can’t find products they simply won’t be buying them. Simple changes to the search algorithms can have huge impacts, costing or saving the company millions in revenue. However, installing a new search engine is just the beginning – it’s vital to consider a solid test strategy. To start with, the new engine should be at least as good as the old one in terms of relevance. The search logs will show what terms your users have typed into the search box, and these can be used to construct a set of queries that can be tested against both engines, with the results being scored by content experts, beta testers and even small groups of real customers. The results of this scoring can be used to inform how the new engine can be tuned – and of course, this should be an ongoing process, as search should never be built as a ‘fire and forget’ project.

There’s sadly little available on the subject of real-world relevance testing, although there’s a forthcoming book by Doug Turnbull and and John Berryman. Based on our work with the clients above, we hope later this year to be able to talk further about relevance testing and tuning – and how to do it right for e-commerce, avoiding significant financial risk.

UPDATE: Doug has been kind enough to give our readers a discount code for his book – “39turnbull” – not sure how long this will last but it gives you 39% off the price so worth a try!

Tags: , , ,

Posted in Uncategorized

April 29th, 2015

1 Comment »

Lucene/Solr London Meetup – BioSolr and Query Deep Dive

This week we held another Lucene/Solr London User Group event, kindly hosted by Barclays at their funky Escalator space in Whitechapel. First to talk were two colleagues of mine, Matt Pearce and Tom Winch, on the BioSolr project: funded by the BBSRC, this is an opportunity for us to work with bioinformaticians at the European Bioinformatics Institute on improving search facilities for systems including the Protein Databank in Europe (PDBe). Tom spoke about how we’ve added features to Solr for autocompleting searches using facets and a new way of integrating external similarity systems with Solr searches – in this case an EBI system that works with protein data – which we’ve named XJoin. Matt then spoke about various ways to index ontology data and how we’re hoping to work towards a standard method for working with ontologies using Solr. The code we’ve developed so far is available in our GitHub repository and the slides are available here.

Next was Upayavira of Odoko Ltd., expert Solr trainer and Apache Foundation member, with an engaging talk about Solr queries. Amongst other things he showed us some clever ways to parameterize queries so that a Solr endpoint can be customized for a particular purpose and how to combine different query parsers. His slides are available here.

Thanks all our speakers, to Barclays for providing the venue and for some very tasty food and to all who attended. We’re hoping the next event will be in the first week of June and will feature talks on measuring and improving relevancy with Solr.

Elastic London User Group Meetup – scaling with Kafka and Cassandra

The Elastic London User Group Meetup this week was slightly unusual in that the talks focussed not so much on Elasticsearch but rather on how to scale the systems around it using other technologies. First up was Paul Stack with an amusing description of how he had worked on scaling the logging infrastructure for a major restaurant booking website, to cope with hundreds of millions of messages a day across up to 6 datacentres. Moving from an original architecture based on SQL and ASP.NET, they started by using Redis as a queue and Logstash to feed the logs to Elasticsearch. Further instances of Logstash were added to glue other parts of the system together but Redis proved unable to handle this volume of data reliably and a new architecture was developed based on Apache Kafka, a highly scalable message passing platform originally built at LinkedIn. Kafka proved very good at retaining data even under fault conditions. He continued with a description of how the Kafka architecture was further modified (not entirely successfully) and how monitoring systems based on Nagios and Graphite were developed for both the Kafka and Elasticsearch nodes (with the infamous split brain problem being one condition to be watched for). Although the project had its problems, the system did manage to cope with 840 million messages one Valentine’s day, which is impressive. Paul concluded that although scaling to this level is undeniably hard, Kafka was a good technology choice. Some of his software is available as open source.

Next, Jamie Turner of PostcodeAnywhere described in general terms how they had used Apache Cassandra and Apache Spark to build a scalable architecture for logging interactions with their service, so they could learn about and improve customer experiences. They explored many different options for their database, including MySQL and MongoDB (regarding Mongo, Jamie raised a laugh with ‘bless them, they do try’) before settling on Cassandra which does seem to be a popular choice for a rock-solid distributed database. As PostcodeAnywhere are a Windows house, the availability and performance of .Net compatible clients was key and luckily they have had a good experience with the NEST client for Elasticsearch. Although light on technical detail, Jamie did mention how they use Markov chains to model customer experiences.

After a short break for snacks and beer we returned for a Q&A with Elastic team members: one interesting announcement was that there will be a Elastic(on) in Europe some time this year (if anyone from the Elastic team is reading this please try and avoid a clash with Enterprise Search Europe on October 20th/21st!). Thanks as ever to Yann Cluchey for organising the event and to open source recruiters eSynergySolutions for sponsoring the venue and refreshments.

Free file filters, search & taxonomy tools from our old Googlecode repository

Google’s GoogleCode service is closing down, in case you hadn’t heard, and I’ve just started the process of moving everything over to our Github account. This prompted me to take a look at what’s there and there’s a surprising amount of open source code I’d forgotten about. So, here’s a quick rundown of the useful tools, examples and crazy ideas we’ve built over the years – perhaps you’ll find some of it useful – please do bear in mind however that we’re not officially supporting most of it!

  • Flax Basic is a simple enterprise search application built using Python and the Xapian search library. You can install this on your own Unix or Windows system to index Microsoft Office, PDF, RTF and HTML files and it provides a simple web application to search the contents of the files. Although the UI is very basic, it proved surprisingly popular among small companies who don’t have the budget for a ‘grown up’ search system.
  • Clade is a proof-of-concept classification system with a built-in taxonomy editor. Each node in the taxonomy is defined by a collection of words: as a document is ingested, if it contains these words then it is attached to the node. We’ve written about Clade previously. Again this is a basic tool but has proved popular and we hope one day to extend and improve it.
  • Flax Filters are a set of Python programs for extracting plain text from a number of common file formats – which is useful for indexing these files for search. The filters use a number of external programs (such as Open Office in ‘headless’ mode) to extract the data.
  • The Lucene Redis Codec is a (slightly crazy) experiment in how the Lucene search engine could store indexed data not on disk, but in a database – our intention was to see if frequently-updated data could be changed without Lucene noticing. Here’s what we wrote at the time.
  • There’s also a tool for removing fields from a Lucene index, a prototype web service interface with a JSON API for the Xapian search engine and an early version of a searchable database for historians, but to be honest these are all pre-alpha and didn’t get much further.

If you like any of these tools feel free to explore them further – but remember your hard hat and archeology tools!

Tags: , , ,

Posted in Technical

March 19th, 2015

No Comments »

Rebrands and changing times for Elasticsearch

I’ve always been careful to distinguish between Elasticsearch (the open source search server based on Lucene) and Elasticsearch (the company formed by the authors of the former) and it seems someone was listening, as the latter has now rebranded as simply Elastic. This was one of the big announcements during their first conference, the other being that after acquiring Norwegian company Found they are now offering a fully hosted Elasticsearch-as-a-service (congratulations to Alex and others at Found!). As Ben Kepes of Forbes writes, this may be something to do with ‘managing tensions within the ecosystem’ (I’ve written previously on how this ecosystem is expanding to include closed-source commercial products, which may make open source enthusiasts nervous) but it’s also an attempt to move away from ’search’ into a wider area encompassing the buzzwords-de-jour of Big Data Analytics.

In any case, it’s clear that Elastic (the company, and that’s hopefully the last time I’ll have to write this!) have a clear strategy for the future – to provide many different commercial options for Elasticsearch and its related projects for as many different use cases as possible. Of course, you can still take the open source route, which we’re helping several clients with at present – I hope to be able to present a case study on this very soon.

Meanwhile, Martin White has identified how a recent book on Elasticsearch describes literally hundreds of features and that ‘The skill lies in knowing which to implement given the nature of the content and the type of query that will be used’ – effective search, as ever, remains a difficult thing to get right, no matter what technology option you choose.

UPDATE: It seems that www.elasticsearch.org, the website for the open source project, is now redirecting to the commercial company website…there is now a new Github page for open source code at https://github.com/elastic

Tags: , , , ,

Posted in News

March 11th, 2015

1 Comment »

IntraTeam 2015 – a brief visit

Last week I dropped in on the IntraTeam 2015 conference in Copenhagen, an event focused on intranets with some content on enterprise search. After a rather pleasant evening of Thai food and networking I attended the last day of the event. The keynote speaker was Dave Snowden, who has an amusing and rather curmudgeonly style of presentation, making sure to note the previous presenters he’d disagreed with for their over-reliance on simplistic concepts of knowledge and how the brain works. His talk was however very interesting and introduced the Cynevin framework (a Welsh word which apparently refers to homing sheep!). He also discussed how the rush to digitisation has had a cost in terms of human cognition, how the concept of an intranet will soon disappear (a brave assertion at an intranet conference) and how future systems should perhaps use storytelling metaphors – with some great examples of how collecting these micro-narratives from employees and others can produce extremely rapid feedback on the health of a business.

Andreas Hallgren of Chalmers University showed the evolution of their site-wide search facility, now based on Apache Solr. Unsurprisingly one of the main problems was determining who ‘owns’ search in their organisation: at least now they have a staff member who dedicates 25% of their time to improving search. He had some interesting points about the seasonality of academic searches and how analytics can be used to ‘measure more, guess less’. I was up next talking about Search Turned Upside Down, using a similar set of slides to this one: thanks to all who came and asked some great questions.

Next was Helen Lippell who I have heard speak before on how to get Enterprise Search right – Helen had some great anecdotes and guidance for an attentive audience. Ed Dale followed with five tips for great search: index the right content, optimise this content, measure search, make a great UI and listen to your users – I can only agree! He also characterised the different kinds of content including the worrying ‘content we think we have but we don’t’. The last presentation I attended was by Anders Quitzau of IBM on their fascinating Watson technology: sadly this was a rather marketing-heavy set of slides, with plenty of newly minted buzzwords such as Cognitive Computing and very little useful detail.

Thanks to Kurt Kragh Sorenson and Kristian Norling for inviting me to speak and attend the conference, next time I hope to see a little more of the event!

A review of Stephen Arnold’s CyberOSINT & Next Generation Information Access

Stephen Arnold, whose blog I enjoy due to its unabashed cynicism about overenthusiastic marketing of search technology, was kind enough to send me a copy of his recent report on CyberOSINT & Next Generation Information Access (NGIA), the latter being a term he has recently coined. OSINT itself refers to intelligence gathered from open, publically available sources, not anything to do with software licenses – so yes, this is all about the NSA, CIA and others, who as you might expect are keen on anything that can filter out the interesting from the noise. Let’s leave the definition (and the moral questionability) of ‘publically available’ aside for now – even if you disagree with its motives, this is a use case which can inform anyone with search requirements of the state of the art and what the future holds.

The report starts off with a foreword by Robert David Steele, who has had a varied and interesting career and lately has become a cheerleader for the other kind of open source – software – as a foundation for intelligence gathering. His view is that the tools used by the intelligence agencies ‘are also not good enough’ and ‘We have a very long way to go’. Although he writes that ‘the systems described in this volume have something to offer’ he later concludes that ‘This monograph is a starting point for those who might wish to demand a “full spectrum” solution, one that is 100% open source, and thus affordable, interoperable, and scalable.’ So for those of us in the open source sector, we could consider Arnold’s report as a good indicator of what to shoot for, a snapshot of the state of the art in search.

Arnold then starts the report with some explanation of the NGIA concept. This is largely a list of the common failings of traditional search platforms (basic keyword search, oft-confusing syntax, separate silos of information, lack of multimedia features and personalization) and how they might be addressed (natural language search, automatic querying, federated search, analytics). I am unconvinced this is as big a step as Arnold suggests though: it seems rather to imply that all past search systems were badly set up and configured and somehow a NGIA system will magically pull everything together for you and tell you the answer to questions you hadn’t even asked yet.

Disappointingly the exemplar chosen in the next chapter is Autonomy IDOL: regular readers will not be surprised by my feelings about this technology. Arnold suggests the creation of the Autonomy software was influenced by cracking World War II codes, rock music and artificial intelligence, which is in my mind adding egg to an already very eggy pudding, and not in step with what I know about the background of Cambridge Neurodynamics (Autonomy’s progenitor, created very soon after – and across the corridor from – Muscat, another Cambridge Bayesian search technology firm where Flax’s founders cut their teeth on search). In particular, Autonomy’s Kenjin tool – which automatically suggested related documents – is identified as a NGIA feature, although at the time I remember it being reminiscent of features we had built a year earlier at Muscat – we even applied for a patent. Arnold does note that ‘[Autonomy founder, Mike] Lynch and his colleagues clamped down on information about the inner workings of its smart software.’ and ‘The Autonomy approach locks down the IDOL components.’ – this was a magic black box of course, with a magically increasing price tag as well. The price tag rose to ridiculous dimensions (even after an equally ridiculous writedown) when Hewlett Packard bought the company.

The report continues with analysis of various other potential NGIA contenders, including Google-funded timeline analysis specialists Recorded Future and BAE Detica – interestingly one of the search specialists from this British company has now gone on to work at Elasticsearch.

The report concludes with a look at the future, correctly identifying advanced analytics as one key future trend. However this conclusion also echoes the foreword, with ‘The cost of proprietary licensing, maintenance, and training is now killing the marketplace. Open source alternatives will emerge, and among these may be a 900 pound gorilla that is free, interoperable and scalable.’. Although I have my issues with some of the examples chosen, the report will be very useful I’m sure to those in the intelligence sector, who like many are still looking for search that works.