Posts Tagged ‘ranking’

London Lucene/Solr Usergroup – Search Relevancy & Hacking Lucene with Doug Turnbull

Last week Doug Turnbull of US-based Open Source Connections visited the UK and spoke at our Meetup. His first talk was on Search Relevancy, an area that we often deal with at Flax: how to tune a search engine to give results that our clients deem relevant, without affecting the results for other queries. Using a client project as an example, Doug talked about how he created a tool to record relevance judgements for a set of queries (or a ‘case’). The underlying Solr search engine could then be adjusted and the tool re-runs the queries to show any change in the position of the scored results. Slides and video of the talk are available – thanks to our hosts SkillsMatter for these.

The tool, Quepid, is a great way to allow non-developers to score search results – in most cases we have seen, if this kind of testing is done at all it is recorded using spreadsheets. The tests then need to be re-run manually and scores updated, which can result in the tuning process taking far too long. This whole area is in need of some rigor and best practise, and to that end Doug is writing a book on Relevant Search which we’re very much looking forward to.

Doug’s second talk was on Hacking Lucene for custom search results, during which he dissected how Lucene queries actually work and how custom scoring algorithms can be used to change search ranking. Although highly technical in parts – and as Doug said, one of the hardest ways to write Lucene code to influence ranking and thus relevance – it was a great window on Lucene’s low level behaviour. Again, slides and video are available.

Thanks to all who came and especially Doug for coming so far to present his talks!

Tags: , , , ,

Posted in Technical, events

June 11th, 2015

No Comments »

Search Solutions 2013, a review

Yesterday was the always interesting Search Solutions one day conference held by the BCS IRSG in London, a mix of talks on different aspects of search. The first presentation was by Behshad Behzadi of Google on Conversational Search, where he showed a speech-capable search interface that allowed a ‘conversation’ with the search engine – context being preserved – so the query “where are Italian restaurants in Chelsea” followed by “no I prefer Chinese” would correctly return results about Chinese restaurants. The demo was impressive and we can expect to see more of this kind of technology as smartphone adoption rises. Wim Nijmeijer of Coveo followed with details of how their own custom connectors to a multitude of repositories could enable Complex enterprise search delivered in a day. This of course assumes that no complex mapping of fields or schemas from the source to the search engine index is necessary, which I suspect it often is – I’m not alone in being slightly suspicious of the supposed timescale. Nikolaos Nanas from Thessaly in Greece then presented on Adaptive Information Filtering: from theory to practise which I found particularly interesting as it described filtering documents against a user’s interest with the latter modelled by an adaptive, weighted network – he showed the Noowit personalised magazine application as an example. With over 1000 features per user and no language specific requirements this is a powerful idea.

After a short break we continued with a talk by Henning Rode on CV Search at TextKernel. He described a simple yet powerful UI for searching CVs (resumes) with autosuggest and automatic field recognition (type in “Jav” and the system suggests “Java” and knows this is a programming language or skill). He is also working on systems to autogenerate queries from job vacancies using heuristics. We’ve worked in the recruitment space ourselves so it was interesting to hear about their approach, although the technical detail was light. Following Henning was Dermot Frost talking about Information Preservation and Access at the Digital Repository of Ireland and their use of open source technology including Solr and Blacklight to build a search engine with a huge variety of content types, file formats and metadata standards across the items they are trying to digitally preserve. Currently this is a relatively small collection of data but they are planning to scale up over the next few years: this talk reminded me a little of last year’s by Emma Bayne of the UK’s National Archive.

After lunch we began a session named Understanding the User, beginning with Filip Radlinski of Microsoft Research. He discussed Sensitive Online Search Evaluation (with as a test collection) and how interleaved results is a powerful technique for avoiding bias. Next was Mounia Lalmas of Yahoo! Labs on what makes An Engaging Click (although unfortunately I had to pop out for a short while so I missed most of what I am sure was a fascinating talk!). Mags Hanley was next on Understanding users search intent with examples drawn from her work at TimeOut – the three main lessons being to know the content in context, the time of year and the users’ mental model in context. Interestingly she showed how the most popular facets used differed across TimeOut’s various international sites – in Paris the top facet was perhaps unsurprisingly ‘cuisine’, in London it was ‘date’.

After another short break we continued with Helen Lippell’s talk on Enterprise Search – how to triage problems quickly and prescribe the right medicine – her five main points being analyze user needs, fix broken content, focus on quick wins in the search UI, make sure you are able to tweak the search engine itself in a documentable fashion and remember the importance of people and process. Her last point ‘if search is a political football, get an outsider perspective’ is of course something we would agree with! Next was Peter Wallqvist of Ravn Systems on Universal Search and Social Networking where he focussed on how to allow users to interact directly with enterprise content items by tagging, sharing and commenting – so as to derive a ‘knowledge graph’ showing how people are connected by their relationships to content. We’ve built systems in the past that have allowed users to tag items in the search result screen itself so we can agree on the value of this approach. Our last presenter with Kristian Norling of Findwise on Reflections on the 2013 Enterprise Search Survey – some more positive news this year, with budgets for search increasing and 79% of respondents indicating that finding information is of high importance for their organisation. Although most respondents still have less than one full time staff member working on search, Kristian made the very good point that recruiting just one extra person would thus give them a competitive advantage. Perhaps as he says we’ve now reached a tipping point for the adoption of properly funded enterprise search regarded as an ongoing journey rather than a ‘fire and forget’ project.

The day finished with a ‘fishbowl’ session, during which there was a lot of discussion of how to foster links between the academic IR community and industry, then the BCS IRSG AGM and finally a drinks reception – thanks to all the organisers for a very interesting and enlightening day and we look forward to next year!

Autonomy & HP – a technology viewpoint

I’m not going to comment on the various financial aspects of the recent news about HP’s write-down of the value of its Autonomy acquisition – others are able to do this far better than me – but I would urge anyone interested to re-read the documents Oracle released earlier this year. However, I am going to write about the IDOL technology itself (I’d also recommend Tony Byrne’s excellent post).

Autonomy’s ability to market its technology has never been in doubt: aggressive and fearless, it painted IDOL as unique and magical, able to understand the meaning of data in multiple forms. However, this has never been true; computers simply don’t understand ‘meaning’ like we do. IDOL’s foundation was just a search engine using Bayesian probabilistic ranking; although most other search technologies use the vector space model there are a few other examples of this approach: Muscat, a company founded a few years before and literally across the hall from Autonomy in a Cambridge incubator, grew to a £30m business with customers including Fujitsu and the Daily Telegraph newspaper. Sadly Muscat was a casualty of the dot-com years but it is where the founders of Flax first met and worked together on a project to build a half-billion-page web search engine.

Another even less well-known example is OmniQ, eventually acquired and subsequently shelved by Sybase. Digging in the archives reveals some familiar-sounding phrases such as “automatically capture and retrieve information based on concepts”.

Originally developed at Muscat, the open source library Xapian also uses Bayesian ranking and we’ve used this successfully to build systems for the Financial Times, Newspaper Licensing Agency and Tait Electronics. Recently, Apache Lucene/Solr version 4.0 has introduced the idea of ‘pluggable’ ranking models, with one option being the Bayesian BM25. It’s important to remember though that Bayesian ranking is only one way to approach a search problem and in many cases, simply unnecessary.

It certainly isn’t magic.

Open source search evening – ElasticSearch, Xapian and GSoC

Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.

Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.

Thanks to Richard for organising the evening and to all who came.

ECIR 2011 Industry Day – part 1 of 2

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

Perspectives on learning at Search Meetup Cambridge

Last night was the second Cambridge search meetup, held in a (rather noisy as it turned out) pub close to the river. It was great to see so many new faces from a wide range of backgrounds including bioinformatics, rare books and academic publishing.

First of the talks was from Tyler Tate of TwigKit, who described the typical search process as a ‘funnel’, narrowing the available options to an eventual conclusion. He told us how the original definition of search removed the user from the picture, and how to improve things we should make it easy to organise, annotate and compare search results to allow both the user and the system itself to learn. His slides are available here.

After a short break we heard from Mike Taylor of Microsoft Research who led us through the history of ranking models, from the classic BM25, through ‘black box’ systems using machine learning methods including gradient descent and neural networks. He mentioned LambdaRank which was unfamiliar to most of us (some papers by Burges et al are available on the Microsoft site). Interestingly it seems that the focus at Microsoft has shifted back to probabilistic models and Mike showed examples including a system for predicting ‘real’ clicks on online adverts (as opposed to automatic clicks by web robots).

Thanks to our speakers and everyone who came and we hope to continue what is proving to be a popular series of events. Next is a gathering of those involved in open source search on Tuesday 3rd May – hope to see some of you there.