Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.
Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.
Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.
Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.
Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.
The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.
After a short break the Cambridge Search Meetup returned last night with our usual mix of presentations, questions, networking, beer and snacks. We had a few issues with the projector and cables (one of these is on the shopping list for next time) so thanks to both presenters and audience for their patience!
First up was Liang Shen with a description of Journal Selector, a system for helping those publishing academic papers to find the correct journals to approach. The system allows one to copy and paste a chunk of a paper to a website and find which journals best match the subject matter, based on what they have published in the past. Running on the Amazon EC2 cloud the service indexes journals from feeds, HTML webpages and other sources, processes and stores this data in Amazon’s Hadoop-compatible database, indexes it with Apache Solr and then presents the results via the Drupal CMS. The results are impressive, allowing users to see exactly on what basis the system has recommended a journal to approach. You can see the presentation slides here.
Next was Rich Marr, who bravely offered to live-code a demonstration of his low-cost prototyping methodology for startups needing both NoSQL data storage and search across this data. In only 20 lines or so of code he showed us how to use Node.js to build a simple server that could accept messages (over Telnet, although HTTP or even IMAP would be as easy), store them in a CouchDB database and index them for searching (using a different message) with Elasticsearch. Rich’s demo prompted a lively discussion of how commoditized and componentized search technology is becoming, with open source components that allow one to build a prototype search engine in minutes.
Thanks to both our speakers – and the Meetups continue, with Rich Marr’s own London Open Source Search Social meeting on Tuesday 23rd October, and in Cambridge the Data Insights Meetup where I’ll be talking on November 1st.
This morning the largest open source search project, Apache Lucene/Solr, released a new version with a raft of new features. We’ve been advising clients to consider version 4.0 for several months now, as the alpha and beta versions have become available, and we know of several already running this version on live sites. Here’s a few highlights:
- Solr Cloud – a collection of new features for scalability and high availability (either on your own servers or on the Cloud), integrating Apache Zookeeper for distributed configuration management.
- More NoSQL features in case you’re planning to use Solr as a primary data store, including a transaction log
- A new web administration interface (including Solr Cloud features)
- New spatial search features including polygon support
- General performance improvements across the board (for example, fuzzy queries are 1-200 times faster!)
- Lucene now has pluggable codecs for storing index data on disk – a potentially powerful technique for performance optimisation, we’ve already been experimenting with storing updatable fields in a NoSQL database
- Lucene now has pluggable ranking models, so you can for example use BM25 Bayesian ranking, previously only available in search engines such as HP Autonomy and the open source Xapian.
The new release has been several years in the making and is a considerable improvement on the previous 3.x version – related projects such as elasticsearch will also benefit. There’s also a new book, Solr in Action, just out to coincide with this release. Exciting times ahead!
Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).
Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.
However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.
It will be interesting to see if this version of the GSA is the last…
Last night our US partners Lucid Imagination announced that LucidWorks, their packaged and supported version of Apache Lucene/Solr, is available on Microsoft’s Azure cloud computing service. It seems like only a few weeks since Amazon announced their own CloudSearch system and no doubt other ’search as a service’ providers are waiting in the wings (we’re going to need a new acronym as SaaS is already taken!). At first the combination of a search platform based on open source Java code with Microsoft hosting might seem strange, and it raises some interesting questions about the future of Microsoft’s own FAST Search technology – is this final proof that FAST will only ever be part of Sharepoint and never a standalone product? However with search technology becoming more and more of a commodity this is a great option for customers looking for search over relatively small numbers of documents.
Lucid’s offering is considerably more flexible and full-featured than Amazon’s, which we hear is pretty basic with a lack of standard search features like contextual snippets and a number of bugs in the client software. You can see the latter in action at Runar Buvik’s excellent OpenTestSearch website. With prices for the Lucid service ranging from free for small indexes, this is certainly an option worth considering.
Amazon have just launched a cloud-based search service, which promises a ‘fully managed search service in the cloud’ – and it certainly looks impressive, with auto-scaling built in. You simply create a service, upload documents as JSON or XML and then perform searches. For cases where you need to search publically available data this offers a great way to avoid having to install and integrate any search software – of course it won’t be so popular if you’re worried about where your data actually is, or other complications such as the Patriot Act.
As you might expect, some people are already offering services based around CloudSearch (we’d be happy to do the same - just ask!) and there’s a demo of searching Wikipedia available. I’m not sure who SmackBot is but I’m slightly wary of reading any Wikipedia articles it’s had something to do with…
Of course searching Wikipedia is nothing new and I sometimes wish for a different choice of source material for search demos.
One thing that seems clear is that with the rise of cloud-based search options (here’s another from our partners Lucid Imagination, based on Apache Lucene/Solr) the cost and complication of ’simple’ search projects could fall dramatically, applying further pressure to those companies selling closed source search engines for frankly unrealistic prices. Amazon’s offering, with their huge experience in cloud-based services, has the potential to be a game changer for the search market.
We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.
C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.
I’ve been reading the revised Open Source, Open Standards and ReUse: Government Action Plan – it’s surprising (and heartening) to see this has existed in one form or another since as far back as 2004.
The key changes for this version are:
suppliers have to show evidence they’ve considered open source options – hopefully this will be more than a quick trawl through SourceForge
’shadow license costs’ have to be shown in calculations to take account of previous purchases of ‘perpetual’ licenses – apparently in some cases this could make software license fees for a project appear as zero!
all purchases have to be on the basis of of re-use across the government sector – so no need to pay again if a system moves to the cloud in the future
This all sounds great for the open source community; let’s also hope that increased openness in government means that we’ll be able check the Action Plan is actually being followed!
By the way a great example of open source in action on government data is They Work For You, which cleans up Hansard and makes more accessible – search is powered by Xapian.