One way to manage digital information is to classify it into a series of categories or a heirarchical taxonomy, and traditionally this was done manually by analysts, who would examine each new document and decide where it should fit. Building and maintaining taxonomies can also be labour intensive, as these will change over time (for a simple example, just consider how political parties change and divide, with factions appearing and disappearing). Search engine technology can be used to automate this classification process and the taxonomy information used as metadata, so that search results can be easily filtered by category, or automatically delivered to those interested in a particular area of the heirarchy.
We’ve been working on an internal project to create a simple taxonomy manager, which we’re releasing today in a pre-alpha state as open source software. Clade lets you import, create and edit taxonomies in a browser-based interface and can then automatically classify a set of documents into the heirarchy you have defined, based on their content. Each taxonomy node is defined by a set of keywords, and the system can also suggest further keywords from documents attached to each node.
This screenshot shows the main Clade UI, with the controls:
A – dropdown to select a taxonomy
B – buttons to create, rename or delete a taxonomy
C – the main taxonomy tree display
D – button to add a category
E – button to rename a category
F – button to delete a category
G – information about the selected category
H – button to add a category keyword
I – button to edit a keyword
J – button to toggle the sense of a keyword
K – button to delete a keyword
L – suggested keywords
M – button to add a suggested keyword
N – list of matching document IDs
O – list of matching document titles
P – before and after document ranks
Clade is based on Apache Solr and the Stanford Natural Language Processing tools, and is written in Python and Java. You can run it on on either Unix/Linux or Windows platforms – do try it and let us know what you think, we’re very interested in any feedback especially from those who work with and manage taxonomies. The README file details how to install and download it.
UPDATE: You can download a ZIP of Clade from here – pick the latest version.
I’ve uploaded a whitepaper I wrote a short while ago :
“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”
It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.
If you’re planning an enterprise search project and have no background in the technologies or principles involved, here are some tips to get you started. This isn’t going to be a definitive list so if you know more, please do comment.
There haven’t been a lot of books written on this area over the years, but more are appearing now (especially on open source options). Managing Gigabytes is a good, if slightly elderly, starting point on basic principles. For thoughts on search user interfaces try Peter Morville’s Search Patterns and for an application focus there’s the recent Search Based Applications. For those developing in the Lucene/Solr world there’s the classic (and recently updated) Lucene in Action and the related Solr 1.4 Enterprise Search Server and Building Search Applications: Lucene, LingPipe, and Gate.
Most people will (of course) start their research on the web, although sometimes it’s hard to find nuggets of real information amongst all the marketing. Wikipedia has a list of vendors, including open source solutions, and Avi Rappaport maintains the useful (although not completely up to date) Search Tools website. Some vendors and some open source projects provide FAQs and tutorials (for example the Lucene FAQ, Xapian and Sphinx documentation), which may also contain general information about search principles.
You might also consider joining discussion groups such as the popular LinkedIn Enterprise Search Engine Professionals or a local Meetup group. Training is another option – offered by some vendors and open source companies such as ourselves.
David Fishman of Lucid Imagination has blogged on how open source search is treated by the analyst community (you can even use his links to get hold of some of the reports mentioned for the usual price of your contact details). We can add to his list a report from the Real Story Group – and I hear Ovum will shortly release an updated report.
What I find most interesting about these analyst reports is how various vendors are subdivided – either by target market, or by size, or by how ‘complex’ their platform is. Open source solutions don’t always fit the categories – for example Real Story Group list ‘Apache Project’ as a ’specialised vendor’ – which it really isn’t. Perhaps it’s time for some new categories in these analyst reports – maybe a list of specialist open source integrators, linked with the available technologies such as Lucene, Xapian or Sphinx, combined with some data about likely costs.
If you’re considering a Lucene/Solr powered search solution, you may be interested in LucidWorks Enterprise, produced by our partners Lucid Imagination. They’ve taken Lucene/Solr and added a powerful admin GUI, ReST API, web spiders, file crawlers, database connectors, alerts, a clickthrough framework and more. All this comes with a range of excellent support options backed by the experts at Lucid.
If you’d like to know more read this downloadable PDF or contact us for more information and a demo.
Peter Morville has created a Flickr collection of ’search patterns’, showing the different kind of search interfaces available. I can highly recommend you take a look if you’d like some good examples of clustering, faceted navigation, auto-suggest and interfaces for certain sectors such as e-commerce. We often find these concepts difficult to explain to customers without some real-world examples.