Author Archive
Details of search events in 2012 are beginning to appear already, here’s a few to start with:
- 1-5 April 2012 – European Conference on Information Retrieval (ECIR) in Barcelona, Spain. An academic conference featuring new developments in IR.
- 7-10 May 2012 – Lucid Imagination’s Lucene Revolution in Boston, USA. The largest conference on open source search – this event has a great buzz as the Lucene/Solr community continues to grow.
- 30/31 May 2012 – Enterprise Search Europe in London, after a successful first event last year. Great for those planning or working on enterprise search projects.
More to come as we hear about them – we’ll also be running another Cambridge Search Meetup soon.
On the twelfth day of (Search) Christmas my inbox brought to me:
Twelve users searching,
Eleven pages found,
Ten facets shown,
Nine Search Meetups,
Eight entity extractors,
Seven SOLR servers,
Six Xapian patches,
Five Open Source,
Four cloud apps,
Three Lucid partners,
Two big acquisitions,
And a Mike Lynch on board at HP.
Have a great Christmas and New Year from everyone at Flax.
James Alexander of the Open University talked first on the Access to Video Assets project, a prototype system that looked at preservation, digitisation and access to thousands of TV programs originally broadcast by the BBC. James’ team have worked out an approach based on open source software – storing programme metadata and video assets in a Fedora Commons repository, indexing and searching using Apache Solr, authentication via Drupal – that is testament to the flexibility of these packages (some of which are being used in non-traditional ways – for example Drupal is used in a ‘nodeless’ fashion). He showed the search interface, which allowed you to find the exact points within a long video where particular words are mentioned and play video directly with a pop-up window. I’d seen this talk before (here’s a video and slides from Lucene Eurocon) but what I hadn’t grasped is how Solr is used as a mediation layer between the user and what can be some very complex data around the video asset itself (subtitles, rights information, format information, scripts etc.). As he mentioned, search is being used as a gateway technology to effective re-use of this huge archive.
Udo Kruschwitz was next with a brief treatment of his ongoing work on automatically extracting domain knowledge and using this to improve search results (for example see the ‘Suggestions’ on the University of Essex website) – he showed us some of the various methods his team have tried to analyze query logs, including Ant Colony Optimisation (modelling ‘trails’ of queries that can be reinforced by repeat visits, or ‘fade’ over time as they are less used). I liked the concept of developing a ‘community’ search profile where individual search profiles are hard to obtain – and how this could be simply subdivided (so for example searchers from inside a university might have a different profile to those outside). The key idea here is that all these techniques are automatic, so the system is continually evolving to give better search suggestions and hints. Udo and his team are soon to release an open source adaptive search framework to be called “Sunny Aberdeen” which we look forward to hearing about.
The evening ended with networking and a pint or two in traditional fashion – thanks to both our speakers and to all who came, from as far afield as Milton Keynes, Essex and Luton. The group now has 70 members and we’re building an active and friendly local community of search enthusiasts.
Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).
However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ’search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).
There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.
These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!
We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.
C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.
I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.
The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.
Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.
After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.
The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.
It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.
The theme of Big Data continued at the next conference I attended, the first Enterprise Search Europe held in London. There was a good mix of presentations ranging from the academic to the practical, my favourite probably being Martin Belam and colleague’s talk about using Solr to dynamically generate content for the new Guardian Books site. I was lucky enough to be able to talk about the real business benefits of open source search along with one of our customers, Stephen Wicks, CTO of Gorkana Group, which drew some interesting questions. We also ran a combined Meetup on the Monday evening, combining Enterprise Search Cambridge with Enterprise Search London.
There did seem to be a rather negative spin on search from many presenters – saying that search technology is misunderstood, more costly than expected, rarely works and hasn’t seen much recent innovation. Some of this is true – but I see this as an opportunity rather than a problem. There is more focus on the world of search now than before due to some high-profile acquisitions; people are questioning the value and capability of search technology. Those of us working at the cutting edge, delivering real working solutions, should perhaps take this opportunity to say that yes, it can be done, at a sensible cost, and it can deliver real business benefit. Perhaps as we move further into the world of Big Data we’ll realise the true value of effective search.
It’s been an interesting and busy few weeks this autumn – starting with Lucene Eurocon in Barcelona. ‘Big Data’ was a main theme, with some great presentations including the keynote from Grant Ingersoll and the talk from Eric Baldeschwieler of Hortonworks, showing how Lucene fits with other Apache projects such as Hadoop, Mahout and HBase. I also enjoyed the presentations from Andrzej Bialecki on a portable index format for Lucene, Jan Høydahl of Cominvent AS on the Solr Update Chain and James Alexander of the Open University on building a Solr-powered search of their video archives. Luckily this year the presentations were videoed – so I can catch up on the presentations I missed – you’ll also be able to see me talk about our recent work with Reed Specialist Recruitment.
Of course, one of the major reasons for attending an event like this is the networking and talks outside the main event, and it was great to catch up with others in the field – one meeting between a number of us with an interest in pipelining and data conditioning led to the creation of an informal group to discuss how we might better share ideas, code and best practises.
While we were at the conference the announcement that search vendor Endeca had been bought by Oracle - and yes, this is also probably about Big Data. These are fascinating times – is search becoming the enabling technology for a revolution in how we deal with digital information?
We’re pleased to announce our work with Reed Specialist Recruitment, one of the UK’s largest recruitment companies, where we helped them implement an Apache Solr powered application to allow their 3000+ staff to search for and match candidates to jobs. We built an innovative indexing framework, a configuration tool and performance monitoring system for Reed and the system launched on time and under budget, a great testament to the flexibility and power of this open source software. The new system responds in under a second – a massive improvement on the previous response time of several minutes. You can read the press release here.
If you’d like to hear more I’ll be giving a presentation on the project at Lucene Eurocon in Barcelona tomorrow – Wednesday 19th October at 1.30 p.m. – slides and a video will be online after the event.
If you can’t make it to Barcelona I’ll also be talking in London, on the business benefits of open source search, at around 10am on Tuesday 25th October with our client Stephen Wicks, CTO of Gorkana Group as part of Enterprise Search Europe – there are still tickets available and you can even get a 20% discount if you join the Cambridge or London Enterprise Search Meetups, who are hosting a joint event on the Monday evening of the conference.
Our customer Cambridge Intellectual Property announced yesterday their new API for a collection of 55 million patents – 48 million more than Google Patents. It’s great to see a Cambridge company innovating in this space, especially as the service is powered by Apache Solr (we’ve given them some small assistance with configuring and tuning this software over the last few months).
The API, available on the Boliven website, offers a REST based service and returns patent data in JSON or XML – so users can easily integrate patent data with their own applications. It can also return PDFs or summaries of the selected patents. In addition, the API will allow users to search and query Boliven’s database of 45+ million science literature documents including journal publications and medical device trials. That’s around 100 million items in total.
Like the Guardian’s Open Platform which I wrote about previously, this is a great example of open source search technology as a platform for new delivery methods – showing how effective (and economical) it can be at this large scale.
It didn’t take me long to find my own small contribution to the patent landscape.