Last week I attended one of my favourite annual search events, Search Solutions, held at the British Computer Society’s base in Covent Garden. As usual this is a great chance to see what’s new in the linked worlds of web, intranet and enterprise search and this year there was a focus on semantic search by several of the presenters.
Peter Mika of Yahoo! started us off with a brief history of semantic search including how misplaced expectations have led to a general lack of adoption. However, the large web search companies have made significant progress over the years leading to shared standards for semantically marking of web content and some large collections of knowledge, which allows them to display content for certain queries, e.g. actor’s biographies shown on the right of the usual search results. He suggested the next step is to better understand queries as most of the work to date has been on understanding documents. Christopher Semturs of Google followed with a description of their efforts in this space, Google’s Knowledge Graph containing 40 billion facts about 530 million entities, built in part by converting web pages directly (including how some badly structured websites can contain the most interesting and rare knowledge). He reminded us of the importance of context and showed some great examples of queries that are still hard to answer correctly. Katja Hofmann of Microsoft then described some ways in which search engines might learn directly from user interactions, including some wonderfully named methodologies such as Counterfactual Reasoning and the Contextual Bandit. She also mentioned their continuing work on Learning to Rank with the open source Lerot software.
Next up was our own Tom Mortimer presenting our study comparing the performance of Apache Solr and Elasticsearch – you can see his slides here. While there are few differences Tom has found that Solr can support three times the query rate. Iadh Ounis of the University of Glasgow followed, describing another open source engine, Terrier, which although mainly focused on academic research does now contain some cutting edge features including the aforementioned Learning to Rank and near real-time search.
The next session featured Dan Jackson of UCL describing the challenges of building website search across a complex set of websites and data, a similar talk to one he gave at an earlier event this year. Next was our ex-colleague Richard Boulton describing how the Gov.uk team use metrics to tune their search capability (based on Elasticsearch). Interestingly most of their metric data is drawn from Google Analytics, as a heavy use of caching means they have few useful query logs.
Jussi Karlgren of Gavagai then described how they have built a ‘living lexicon’ of text in several languages, allowing for the representation of the huge volume of new terms that appear on social media every week. They have also worked on multi-dimensional sentiment analysis and visualisations: I’ll be following these developments with interest as they echo some of the work we have done in media monitoring. Richard Ranft of the British Library then showed us some of the ways search is used to access the BL’s collection of 6 million audio tracks including very early wax cylinder recordings – they have so much content it would take you 115 years to listen to it all! The last presentation of the day was by Jochen Leidner of Thomson Reuters who showed some of the R&D projects he has worked on for data including legal content and mining Twitter for trading signals.
After a quick fishbowl discussion and a glass of wine the event ended for me, but I’d like to thank the BCS IRSG for a fascinating day and for inviting us to speak – see you next year!