Posts Tagged ‘taxonomy’

Innovations in Knowledge Organisation, Singapore: a review

I’m just back from Singapore: my first visit to this amazing, dynamic and everchanging city-state, at the kind invitation of Patrick Lambe, to speak at the first Innovations in Knowledge Organisation conference. I think this was probably one of the best organised and most interesting events I’ve attended in the last few years.

The event started with an enthusiastic keynote from Patrick, introducing the topics we’d discuss over the next two days: knowledge management, taxonomies, linked data and search, a wide range of interlinked and interdependent themes. Next was a series of quick-fire PechaKucha sessions – 20 slides, 20 seconds each – a great way to introduce the audience to the topics under discussion, although slightly terrifying to deliver! I spoke on open source search, covering Elasticsearch & Solr and how to start a project using them, and somehow managed to draw breath occasionally. I think my fellow presenters also found it somewhat challenging although nobody lost the pace completely! Next was a quick, interactive panel discussion (roving mics rather than a row of seats) that set the scene for how the event would work – reactive, informal and exciting, rather than the traditional series of audience-facing Powerpoint presentations which don’t necessarily combine well with jetlag.

After lunch, showcasing Singapore’s multicultural heritage (I don’t think I’ve ever had pasta with Chinese peppered beef before, but I hope to again) we moved on to the first set of case studies. Each presenter had 6 minutes to sell their case study (my own was about how we helped Reed Specialist Recruitment build an open source search platform) and then attendees could choose which tables to join to discuss the cases further, for three 20-minute sessions. I had some great discussions including hearing about how a local government employment agency has used Solr. We then moved on to a ‘knowledge cafe’, with tables again divided up by topics chosen by the audience – so this really was a conference about what attendees wanted to discuss, not just what the presenters thought was important.

I was scheduled to deliver the keynote the next day, having been asked to speak on ‘The Future of Search’ – I chose to introduce some topics around Big Data and Streaming Analytics, and how search software might be used to analyze the huge volumes of data we might expect from the Internet of Things. I had some great feedback from the audience (although I’m pretty sure I inspired and confused them in equal measure) – perhaps Singapore was the right place to deliver this talk, as the government are planning to make it the world’s first ‘smart nation‘ – handling data will absolutely key to making this possible.

More case study pitches followed, and since I wasn’t delivering one myself this time I had a chance to listen to some of the studies. I particularly enjoyed hearing from Kia Siang Hock about the National Library Board Singapore’s OneSearch service, which allowed a federated search across tens of millions of items from many different repositories (e.g. books, newspaper articles, audio transcripts). The technologies used included Veridian, Solr, Vocapia for speech transcription and Mahout for building a recommendation system. In particular, Solr was credited for saving ‘millions of Singapore dollars’ in license fees compared to the previous closed source search system it replaced. Also of interest was Straits Knowledge’s system for capturing the knowledge assets of an organisation with a system built on a graph database, and Haliza Jailani on using named entity recognition and Linked Data (again for the National Library Board Singapore).

We then moved into the final sessions of the day, ‘knowledge clinics’ – like the ‘knowledge cafes’ these were table-based, informal and free-form discussions around topics chosen by attendees. Matt Moore then gave the last session of the day with an amusing take on Building Competencies, dividing KM professionals into individuals, tribes and organisations. Patrick and Maish Nichani then closed the event with a brief summary.

Singapore is a long way to go for an event, but I’m very glad I did. The truly international mix of attendees, the range of subjects and the dynamic and focused way the conference was organised made for a very interesting and engaging two days: I also made some great contacts and had a chance to see some of this beautiful city. Congratulations to Patrick, Maish and Dave Clarke on a very successful inaugural event and I’m looking forward to hearing about the next one! Slides and videos are already appearing on the IKO blog.

Free file filters, search & taxonomy tools from our old Googlecode repository

Google’s GoogleCode service is closing down, in case you hadn’t heard, and I’ve just started the process of moving everything over to our Github account. This prompted me to take a look at what’s there and there’s a surprising amount of open source code I’d forgotten about. So, here’s a quick rundown of the useful tools, examples and crazy ideas we’ve built over the years – perhaps you’ll find some of it useful – please do bear in mind however that we’re not officially supporting most of it!

  • Flax Basic is a simple enterprise search application built using Python and the Xapian search library. You can install this on your own Unix or Windows system to index Microsoft Office, PDF, RTF and HTML files and it provides a simple web application to search the contents of the files. Although the UI is very basic, it proved surprisingly popular among small companies who don’t have the budget for a ‘grown up’ search system.
  • Clade is a proof-of-concept classification system with a built-in taxonomy editor. Each node in the taxonomy is defined by a collection of words: as a document is ingested, if it contains these words then it is attached to the node. We’ve written about Clade previously. Again this is a basic tool but has proved popular and we hope one day to extend and improve it.
  • Flax Filters are a set of Python programs for extracting plain text from a number of common file formats – which is useful for indexing these files for search. The filters use a number of external programs (such as Open Office in ‘headless’ mode) to extract the data.
  • The Lucene Redis Codec is a (slightly crazy) experiment in how the Lucene search engine could store indexed data not on disk, but in a database – our intention was to see if frequently-updated data could be changed without Lucene noticing. Here’s what we wrote at the time.
  • There’s also a tool for removing fields from a Lucene index, a prototype web service interface with a JSON API for the Xapian search engine and an early version of a searchable database for historians, but to be honest these are all pre-alpha and didn’t get much further.

If you like any of these tools feel free to explore them further – but remember your hard hat and archeology tools!

Tags: , , ,

Posted in Technical

March 19th, 2015

No Comments »

Enterprise Search & Discovery 2014, Washington DC

Last week I attended Enterprise Search & Discovery 2014, part of the KMWorld conference in Washington DC. I’d been asked to speak on Turning Search Upside Down and luckily had the first slot after the opening keynote: thanks to all who came and for the great feedback (there are slides available to conference attendees, I’ll publish them more widely soon, but this talk was about media monitoring, our Luwak library and how we have successfully replaced Autonomy IDOL and Verity with a powerful open source solution for a Scandinavian monitoring firm).

Since ESSDC is co-located with KMWorld, Sharepoint Symposium and Taxonomy Bootcamp, it feels like a much larger event than the similar Enterprise Search Europe, although total numbers are probably comparable. It was clear to me that the event is far more focused on a business rather than technical audience, with most of the talks being high-level (and some being simply marketing pitches, which was a little disappointing). Mentions of open source search were common (from Dion Hinchcliffe’s use of it as an example of a collaborative community, to Kamran Kahn’s example of Apache Solr being used for very large scale search at the US National Archives). Unfortunately a lot of the presenters started with the ’search sucks, everyone hates search’ theme (before explaining of course that their own solution would suck less) which I’m personally becoming a little tired of – if we as an industry continue pursuing this negative sentiment we’re unlikely to raise the profile of enterprise search: perhaps we should concentrate on more positive stories as they certainly do exist.

I spent a lot of time networking with other attendees and catching up with some old contacts (a shout out to Miles Kehoe, Eric Pugh, Jeff Fried and Alfresco founder John Newton, great to see you all again). My favourite presentation was Dave Snowden’s fantastic and very funny debunking of knowledge management myths (complete with stories about London taxi drivers and a dig at American football) and I also enjoyed Raytion’s realistic case studies (‘no-one is searching for the sake of searching – except us [search integrators] of course’). Presentations I enjoyed somewhat less included Brainspace (who stressed Transparency as a key value, then when I asked if their software was thus open source, explained that they would love it to be so but then they wouldn’t be able to get any investment – has anyone told Elasticsearch?) and Hewlett Packard, who tried to tell us that their new API to the venerable IDOL search engine was ‘free software’ – not by any definition I’m aware of, sorry. Other presentation themes included graph/semantic search – maybe this is finally something we can consider seriously, many years after Tim Berners Lee’s seminal paper [PDF].

Thanks to Information Today, Marydee Ojala and all others concerned for organising the event and making me feel so welcome.

Tags: , , , , , ,

Posted in events

November 12th, 2014

No Comments »

Clade – a freely available, open source taxonomy and autoclassification tool

One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy.

We’ve been working on an internal project to create a simple taxonomy manager, which we’re releasing today in a pre-alpha state as open source software. Clade lets you import, create and edit taxonomies in a browser-based interface and can then automatically classify a set of documents into the heirarchy you have defined, based on their content. Each taxonomy node is defined by a set of keywords, and the system can also suggest further keywords from documents attached to each node.

This screenshot shows the main Clade UI, with the controls:

A – dropdown to select a taxonomy
B – buttons to create, rename or delete a taxonomy
C – the main taxonomy tree display
D – button to add a category
E – button to rename a category
F – button to delete a category
G – information about the selected category
H – button to add a category keyword
I – button to edit a keyword
J – button to toggle the sense of a keyword
K – button to delete a keyword
L – suggested keywords
M – button to add a suggested keyword
N – list of matching document IDs
O – list of matching document titles
P – before and after document ranks

Clade is based on Apache Solr and the Stanford Natural Language Processing tools, and is written in Python and Java. You can run it on on either Unix/Linux or Windows platforms – do try it and let us know what you think, we’re very interested in any feedback especially from those who work with and manage taxonomies. The README file details how to install and download it.

Outside the search box – when you need more than just a search engine

Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).

However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ’search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).

There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.

These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!

Search Solutions 2011 review

I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.

London Enterprise Search Meetup – Databases vs. Search and Taxonomies

Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.

Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.

Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.

Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.

Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.

Tags: , , , , ,

Posted in events

April 14th, 2011

1 Comment »

Legal search is broken – can it be fixed with open source taxonomies?

I spent yesterday afternoon at the International Society for Knowledge Organisation’s Legal KnowHow event, a series of talks on legal knowledge and how it is managed. The audience was a mixture of lawyers, legal information managers, vendors and academics, and the talks came from those who are planning legal knowledge systems or implementing them. I also particularly enjoyed hearing from Adam Wyner from Liverpool University who is modelling legal arguments in software, using open source text analysis. You can see some of the key points I picked up on our Twitter feed.

What became clear to me during the afternoon is that search technology is not currently serving the needs of lawyers or law firms. The users want a simple Google-like interface (or think they do), the software is having trouble presenting results in context and the source data is large, complex and unwieldy. The software used for search is from some of the biggest commercial search vendors (legal firms seem to ‘follow the pack’ in terms of what vendor they select – unfortunately few of the large law firms seem to have even considered the credible open source alternatives such as Lucene/Solr or Xapian).

In many cases taxonomies were presented as the solution – make sure every document fits tidily into a heirarchy and all the search problems go away, as lawyers can simply navigate to what they need. All very simple in theory – however each big law firm and each big legal information publisher has their own idea of what this taxonomy should be.

After the final presentation I argued that this seemed to be a classic case where an open source model could help. If a firm, or publisher were prepared to create an open source legal taxonomy (and to be fair, we’re only talking about 5000 entries or so – this wouldn’t be a very big structure) and let this be developed and improved collaboratively, they would themselves benefit from others’ experience, the transfer of legal data between repositories would be easier and even the search vendors might learn a little about how lawyers actually want to search. The original creators would be seen as thought-leaders and could even license the taxonomy so it could not be rebadged and passed off as original by another firm or publisher.

However my plea fell on stony ground: law firms seem to think that their own taxonomies have inherent value (and thus should never be let outside the company) and they regard the open source model with suspicion. Perhaps legal search will remain broken for the time being.

Tags: , , , , , , ,

Posted in events

November 11th, 2010