Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.
Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.
Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.
Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.
Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.
The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.
The diary is beginning to fill up – here are a few events we’ll be involved with over the next few months. Firstly we’re running another Cambridge Search Meetup on October 17th – this is an informal gathering of people interested in search, we have one great talk already on ‘Making search accessible to low cost apps’ and another to be confirmed, plus snacks, beer and even some live music afterwards. If you’re in Cambridge or nearby (it’s only an hour or so from London by train) do come along.
We’ll be briefly visiting the trade stands at FIBEP 2012 on October 4th in the historic town of Krakow, Poland – this is part of a major media monitoring event, the 45th FIBEP Congress. We’re looking forward to meeting companies in the media monitoring sector and talking about some of our projects in that area.
On November 29th we’re planning to attend Search Solutions 2012 in at the BCS in Covent Garden, London – this is an excellent one-day event on all the technical aspects of search. You can read my review of last year’s event to find out more about what to expect.
There’s sure to be more to come!
We’re happy to announce we’ve just finished a successful project for a division of the Australian Associated Press to replace a closed source search engine with a considerably more powerful open source solution. You can read the press release here.
As our client had a large investment in stored searches (which represent a client’s interests) which were defined in the query language of their previous search engine, we first had to build a modified version of Apache Lucene that replicated exactly this syntax. I’ve previously blogged about how we did this. However this wasn’t the only challenge: search engines are designed to be good at applying a few queries to a very large document collection, not necessarily at applying tens of thousands of stored queries to every single new document. For media monitoring applications this kind of performance is essential as there may be hundreds of thousands of news articles to monitor every day. The system we’ve built is capable of applying tens of thousands of stored queries every second.
With the rapid increase in the volume of content that media monitoring companies have to check for their clients – today’s news isn’t just in print, but online, in social media and indeed multimedia – it may be that open source software is the only way to build monitoring systems that are economically scalable, while remaining accurate and flexible enough to deliver the right results to clients.
We’ve been working on a client project where we needed to replace the dtSearch closed source search engine, which doesn’t perform that well at scale in this case. As the client has significant investment in stored queries (it’s for a monitoring application) they were keen that the new engine spoke exactly the same query language as the old – so we’ve built a version of Apache Lucene to replace dtSearch. There are a few other modifications we had to do as well, to return such things as positional information from deep within the Lucene code (this is particularly important in monitoring as you want to show clients where the keywords they were interested in appeared in an article – they may be checking their media coverage in detail, and position on the page is important).
First, we developed a new Lucene Analyzer that speaks the same syntax as dtSearch, allowing us to index text input. On the search side we have a Lucene QueryParser that shares this syntax. To make it easier to use we’ve wrapped the whole lot in a modified Solr server. As we needed some features of very recent Lucene code, our modifications are based on a patch to Lucene trunk (and so the source code isn’t for the faint hearted – if you need it let us know, but we’re not currently providing it for download).
We’re not sure if there’s anyone else out there who needs an open source alternative to dtSearch – but in case there is we’ve provided a downloadable WAR file with the latest Solr executables in our downloads area, including a brief README file.
More generally, what this project demonstrates is that even if you have significant investment in your existing search infrastructure it is entirely possible to move to an open source alternative, which may be faster and will almost certainly be more economically scalable. Does anyone else have a search engine they’d like to replace?
We’re working with a number of clients on media monitoring solutions, which are a special case of search application (we’ve worked on this previously for Durrants). In standard search, you apply a single query to a large amount of documents, expecting to get a ranked list of documents that match your query as a result. However in media monitoring you need to search each incoming document (for example, a news article or blog post) with many queries representing what the end user wants to monitor – and you need to do this quickly as you may have tens or hundreds of thousands of articles to monitor in close to real time (Durrants have over 60,000 client queries to apply to half a million articles a day). This ‘backwards’ search isn’t really what search engines were designed to do, so performance could potentially be very poor.
There are several ways around this problem: for example in most cases you don’t need to monitor every article for every client, as they will have told you they’re only interested in certain sources (for example, a car manufacturer might want to keep an eye on car magazines and the reviews in the back page of the Guardian Saturday magazine, but doesn’t care about the rest of the paper or fashion magazines). However, pre-filtering queries in this way can be complex especially when there are so many potential sources of data.
We’ve recently managed to develop a method for searching incoming articles using a brute-force approach based on Apache Lucene which in early tests is performing very well – around 70,000 queries applied to a single article in around a second on a standard MacBook. On suitable server hardware this would be even faster – and of course you have all the other features of Lucene potentially available, such as phrase queries, wildcards and highlighting. We’re looking forward to being able to develop some powerful – and economically scalable – media monitoring solutions based on this core.
Media monitoring is not a traditional search application: for a start, instead of searching a large number of documents with a single query, a media monitoring application must search every incoming news story with potentially thousands of queries, searching for words and terms relevant to client requirements. This can be difficult to scale, especially when accuracy must be maintained – a client won’t be happy if their media monitors miss relevant stories or send them news that isn’t relevant.
We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).
The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.
As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.