ECIR 2011 Industry day – part 2 of 2

Here’s the second writeup.

We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.

Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.

Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.

I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).

Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.

ECIR 2011 Industry Day – part 1 of 2

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

ECIR 2011 overview

I spent part of last week at the 33rd European Conference on Information Retrieval in Dublin, as I had been asked to speak during the Industry Day (of which, more later – far too much useful information to cram into one blog post!). Arriving late afternoon on Wednesday I caught up with Olly Betts of Oligarchy, one of the core Xapian developers who’d travelled from New Zealand. Olly told me more about the Xapian projects running as part of Google’s Summer of Code – very exciting to hear that there were over 40 applicants this year for a limited number of slots.

We went on to the conference banquet at the Lyons Estate outside the city – which in some ways reminded me of Portmeirion – and caught up with people from Google Zurich amongst others. This was one of several fantastic venues organised by the Dublin team led by Cathal Gurrin (at Industry Day itself we were high above the city with great view, and I heard good things about the Guinness Storehouse, the venue for the first day of the conference).

Thanks to all the team (especially Udo Kruschwitz and Tony Russell-Rose for organising Industry Day). I look forward to catching up with some of you at the next BCS IRSG Search Solutions event on November 16th.

April 26th, 2011

London Enterprise Search Meetup – Databases vs. Search and Taxonomies

Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.

Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.

Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.

Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.

Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.

April 14th, 2011

Perspectives on learning at Search Meetup Cambridge

Last night was the second Cambridge search meetup, held in a (rather noisy as it turned out) pub close to the river. It was great to see so many new faces from a wide range of backgrounds including bioinformatics, rare books and academic publishing.

First of the talks was from Tyler Tate of TwigKit, who described the typical search process as a ‘funnel’, narrowing the available options to an eventual conclusion. He told us how the original definition of search removed the user from the picture, and how to improve things we should make it easy to organise, annotate and compare search results to allow both the user and the system itself to learn. His slides are available here.

After a short break we heard from Mike Taylor of Microsoft Research who led us through the history of ranking models, from the classic BM25, through ‘black box’ systems using machine learning methods including gradient descent and neural networks. He mentioned LambdaRank which was unfamiliar to most of us (some papers by Burges et al are available on the Microsoft site). Interestingly it seems that the focus at Microsoft has shifted back to probabilistic models and Mike showed examples including a system for predicting ‘real’ clicks on online adverts (as opposed to automatic clicks by web robots).

Thanks to our speakers and everyone who came and we hope to continue what is proving to be a popular series of events. Next is a gathering of those involved in open source search on Tuesday 3rd May – hope to see some of you there.

Events: Open source for government and search in Cambridge

We’ll be attending the Guardian’s Public Procurement Show on June 14th & 15th as part of the Open Goverment stand – with the recent release by the UK government Cabinet Office of a new IT strategy (here are some industry reactions) it will be interesting to see whether anyone still believes the FUD about open source in the face of the evidence.

We’re also organising another search meetup in Cambridge on April 5th, this time featuring two perspectives on learning, and will also be at a more informal gathering of open source search people on May 3rd.

April 1st, 2011

