Posts Tagged ‘xapian’

Free file filters, search & taxonomy tools from our old Googlecode repository

Google’s GoogleCode service is closing down, in case you hadn’t heard, and I’ve just started the process of moving everything over to our Github account. This prompted me to take a look at what’s there and there’s a surprising amount of open source code I’d forgotten about. So, here’s a quick rundown of the useful tools, examples and crazy ideas we’ve built over the years – perhaps you’ll find some of it useful – please do bear in mind however that we’re not officially supporting most of it!

  • Flax Basic is a simple enterprise search application built using Python and the Xapian search library. You can install this on your own Unix or Windows system to index Microsoft Office, PDF, RTF and HTML files and it provides a simple web application to search the contents of the files. Although the UI is very basic, it proved surprisingly popular among small companies who don’t have the budget for a ‘grown up’ search system.
  • Clade is a proof-of-concept classification system with a built-in taxonomy editor. Each node in the taxonomy is defined by a collection of words: as a document is ingested, if it contains these words then it is attached to the node. We’ve written about Clade previously. Again this is a basic tool but has proved popular and we hope one day to extend and improve it.
  • Flax Filters are a set of Python programs for extracting plain text from a number of common file formats – which is useful for indexing these files for search. The filters use a number of external programs (such as Open Office in ‘headless’ mode) to extract the data.
  • The Lucene Redis Codec is a (slightly crazy) experiment in how the Lucene search engine could store indexed data not on disk, but in a database – our intention was to see if frequently-updated data could be changed without Lucene noticing. Here’s what we wrote at the time.
  • There’s also a tool for removing fields from a Lucene index, a prototype web service interface with a JSON API for the Xapian search engine and an early version of a searchable database for historians, but to be honest these are all pre-alpha and didn’t get much further.

If you like any of these tools feel free to explore them further – but remember your hard hat and archeology tools!

Tags: , , ,

Posted in Technical

March 19th, 2015

No Comments »

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Apache Lucene & Solr version 4.0 released, a giant leap forward for open source search

This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:

  • Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
  • More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
  • A new web administration interface (including Solr Cloud features)
  • New spatial search features including polygon support
  • General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
  • Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
  • Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.

The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!

An open day on open source search from Sirius & Flax

We spent Friday at the riverside offices of Sirius Corporation, our support partners, for the first and hopefully not the last of their Open Days on open source enterprise search. We were lucky to have Mike Davis, a very well known and highly experienced analyst to open the talks – despite suffering from flu he gave an engaging talk on why open source enterprise search software should be your first port of call, and how you should only consider closed source options when you need particular features they provide.

We then gave a quick Introduction to Open Source Search, detailing the various packages available (from Apache Lucene/Solr to Xapian and Sphinx) and showing a quick Solr-powered demo we’d built to search some pages from the BBC Music website. Using the programmer’s first choice for an example query (the ever reliable ‘foo*’) we discovered the wonderfully named Original Rabbit Foot Spasm Band – which interestingly you can’t find via the BBC’s own site search engine due to lack of wildcard support.

Andrew Savory, Sirius’ CTO and Apache Foundation member, then gave a presentation on what an Apache project actually is and how best to engage with an open source community – very useful for those considering open source for the first time. The morning finished with a delicious barbeque on the riverbank provided by Sirius. We thought the event went very well and we’d love to confirm the rumour that this will become a regular event. Thanks to all at Sirius for organising and hosting the day and we look forward to returning.

Searching for (and finding) open source in the UK Government

There have been some very encouraging noises recently about increased use of open source software by the UK Government: for example we’ve seen the creation of an Open Source Procurement Toolkit by the Cabinet Office, which lists Xapian and Apache Lucene/Solr as alternatives to the usual closed source options. The CESG, the “UK Government’s National Technical Authority for Information Assurance”, has clarified its position on open source software, which has led to the Cabinet Office dispelling some of the old myths about security and open source. We know that the Cabinet Office’s ’skunkworks’, the Government Digital Service, are using Solr for several of their projects. Francis Maude MP was recently in the USA with some of the GDS team and visited amongst others our US partners Lucid Imagination.

The British Computer Society have helped organise a series of Awareness Events for civil servants and I’m glad to be speaking at the first of these next Tuesday 21st February on open source search – hopefully this will further increase the momentum and make it even more clear that a modern Government needs to consider this modern, flexible and economically scalable approach to software.

Tags: , , , , , , ,

Posted in News, events

February 17th, 2012

No Comments »

Flax’s 10th birthday!

Today marks 10 years since we formed Flax (originally as Lemur Consulting Ltd.). We had an idea that search based on open source software was going to be increasingly important (indeed, our original business model was consultancy based on Xapian) and I think we’ve been proved right over the decade. Today, in the depths of a recession, we’re seeing significant growth in the business and some fascinating opportunities: the sector is still going through rapid change and it will be very interesting to see what the next few years bring.

Thanks to all of those who have worked with us and for us over the last decade – we look forward to the next ten years in this exciting field!

Tags: , , ,

Posted in events

July 27th, 2011

No Comments »

Whitepaper – Why you should be considering open source search

I’ve uploaded a whitepaper I wrote a short while ago :

“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”

It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.

Open source search evening – ElasticSearch, Xapian and GSoC

Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.

Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.

Thanks to Richard for organising the evening and to all who came.

ECIR 2011 overview

I spent part of last week at the 33rd European Conference on Information Retrieval in Dublin, as I had been asked to speak during the Industry Day (of which, more later – far too much useful information to cram into one blog post!). Arriving late afternoon on Wednesday I caught up with Olly Betts of Oligarchy, one of the core Xapian developers who’d travelled from New Zealand. Olly told me more about the Xapian projects running as part of Google’s Summer of Code – very exciting to hear that there were over 40 applicants this year for a limited number of slots.

We went on to the conference banquet at the Lyons Estate outside the city – which in some ways reminded me of Portmeirion – and caught up with people from Google Zurich amongst others. This was one of several fantastic venues organised by the Dublin team led by Cathal Gurrin (at Industry Day itself we were high above the city with great view, and I heard good things about the Guinness Storehouse, the venue for the first day of the conference).

Thanks to all the team (especially Udo Kruschwitz and Tony Russell-Rose for organising Industry Day). I look forward to catching up with some of you at the next BCS IRSG Search Solutions event on November 16th.

Tags: , , ,

Posted in events

April 26th, 2011

No Comments »

Open Source action in UK government

I’ve been reading the revised Open Source, Open Standards and ReUse: Government Action Plan – it’s surprising (and heartening) to see this has existed in one form or another since as far back as 2004.

The key changes for this version are:

  • suppliers have to show evidence they’ve considered open source options – hopefully this will be more than a quick trawl through SourceForge
  • ’shadow license costs’ have to be shown in calculations to take account of previous purchases of ‘perpetual’ licenses – apparently in some cases this could make software license fees for a project appear as zero!
  • all purchases have to be on the basis of of re-use across the government sector – so no need to pay again if a system moves to the cloud in the future
  • This all sounds great for the open source community; let’s also hope that increased openness in government means that we’ll be able check the Action Plan is actually being followed!

    By the way a great example of open source in action on government data is They Work For You, which cleans up Hansard and makes more accessible – search is powered by Xapian.

    Tags: , , ,

    Posted in News

    February 2nd, 2011

    No Comments »