Archive for the ‘events’ Category
We’ve been helping to organise a new conference to be held in London this October, Enterprise Search Europe. This two-day event promises to give a ‘European perspective on the technology, selection, implementation and optimisation of enterprise-scale search’ and features speakers from 3i plc, Logica, The Guardian and a number of search providers such as Findwise, Funnelback and ourselves (I’ll be talking on ‘Building a Strong Business Foundation with Open Source Search’ on the second day).
It’s going to be a busy time as I’m also chairing a panel on the first day and helping run the evening reception, which is co-hosted by the London and Cambridge Search Meetups – this is likely to be one of the largest Search Meetups ever and is sure to be a fascinating evening, featuring speakers from the conference in an informal setting (i.e., a pub!).
Hope to see some of you there.
The Cambridge Enterprise Search Meetup last night featured Francis Rowland of the European Bioinformatics Institute and Rob Stacey of TrueKnowledge, in a newly refurbished venue. Thanks to all those who came and it was good to meet some new faces.
Francis talked about how search user interfaces should try not to restrict the user’s ‘flow’ of activity, as search is after all only a means to and end. Among the wealth of material he mentioned was the Endeca User Interface Design Pattern Library and what is sure to be a very useful upcoming book, Search Analytics for Your Site.
Rob told us about how TrueKnowledge provides a semantic question answering system – trying to understand the goal(s) of someone asking the system a question such as “is Madonna single?”. He also mentioned how this kind of technology might be applied to an enterprise environment, for example to answer questions like “has the invoice for last Thursday’s job been paid?”. Rob’s talk sparked off a very active Q&A session, with the audience raising issues such as how TrueKnowledge’s method might be applied to languages other than English and how to model the trustworthiness of their sources, which include Wikipedia.
Francis’ slides are now online – with some great sketchnotes of Rob’s talk as well! Thanks to both our speakers.
I spent yesterday evening at the British Computer Society on the panel of an event organised by the Open Source Specialist Group, nominally discussing the skills required to build Content Management Systems (CMS) using open source software (OSS). We heard a lot about a the features and advantages of CMS such as Joomla, Drupal and Plone and the document management system Alfresco, and I contributed some details of Apache Lucene/Solr and Xapian which can be used in concert with all of these systems (and are usually available as plug-in modules).
We also considered how best to encourage the further use of OSS within the UK government, and I’ve tried to list some of the suggestions that were made – this is in no way a complete list, but it’s a start.
- Look at what has been done with OSS in other countries in the government sector – e.g. the PloneGov initiative. A lot of this knowledge and expertise should be transferable.
- Publicise current use within government – we all know that OSS is already being used on government websites and intranets, but if this can be more widely known it will encourage further use of OSS within the sector. We hear that there are already ’skunkworks’ teams in government using open source and open standards – make sure we hear more about what they build.
- Support the open source projects themselves – this could be by contributing code developed within government back to OSS projects, or by supporting the open source community in other ways – for example, funding the creation of better documentation, or making it easier to run open source conferences (perhaps with the help of local goverment).
- Improve the procurement process to better understand open source as a viable alternative and to ease its adoption (for example, many open source companies are smaller than closed source vendors and thus less able to engage in lengthy and expensive procurement rounds).
- Understanding that comparing OSS to a closed source product is often like comparing apples to oranges – OSS provides a highly flexible toolkit where the user chooses what features they want, as opposed to a closed source product where feature sets are fixed by the vendor. During procurement, simple ‘check box’ lists of required features are thus not always applicable.
- Listen more to OSS experts and bringing them into goverment to help educate and inform.
We’ve recently been forging links with the UK’s larger open source software community and have joined the Open Source Consortium. Another interesting organisation is Guildfoss who have asked us to speak at an event on 9th June at the British Computer Society’s offices in London on discussing the skills necessary for building content management systems (search being an important part of this).
Guildfoss are also organising the the ‘open government’ stand at the SmartGov Live show on June 14th-15th (part of the Guardian’s Public Procurement Show), where we’ll be talking about and demonstrating a range of solutions based on open source search, including LucidWorks Enterprise. Do let us know if you’re attending the show and would like to meet up.
We’re also helping with a new search event to be held in London in October – Enterprise Search Europe. One of the major themes of this event will be open source enterprise search and there are some fascinating presentations and workshops lined up.
Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.
Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.
Thanks to Richard for organising the evening and to all who came.
Here’s the second writeup.
We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.
Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.
Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.
I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).
Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.
As promised here’s a writeup of the day itself. I’ve split this into two parts.
The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.
Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.
Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.
After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.
We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.
I spent part of last week at the 33rd European Conference on Information Retrieval in Dublin, as I had been asked to speak during the Industry Day (of which, more later – far too much useful information to cram into one blog post!). Arriving late afternoon on Wednesday I caught up with Olly Betts of Oligarchy, one of the core Xapian developers who’d travelled from New Zealand. Olly told me more about the Xapian projects running as part of Google’s Summer of Code – very exciting to hear that there were over 40 applicants this year for a limited number of slots.
We went on to the conference banquet at the Lyons Estate outside the city – which in some ways reminded me of Portmeirion – and caught up with people from Google Zurich amongst others. This was one of several fantastic venues organised by the Dublin team led by Cathal Gurrin (at Industry Day itself we were high above the city with great view, and I heard good things about the Guinness Storehouse, the venue for the first day of the conference).
Thanks to all the team (especially Udo Kruschwitz and Tony Russell-Rose for organising Industry Day). I look forward to catching up with some of you at the next BCS IRSG Search Solutions event on November 16th.
Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.
Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.
Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.
Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.
Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.
Last night was the second Cambridge search meetup, held in a (rather noisy as it turned out) pub close to the river. It was great to see so many new faces from a wide range of backgrounds including bioinformatics, rare books and academic publishing.
First of the talks was from Tyler Tate of TwigKit, who described the typical search process as a ‘funnel’, narrowing the available options to an eventual conclusion. He told us how the original definition of search removed the user from the picture, and how to improve things we should make it easy to organise, annotate and compare search results to allow both the user and the system itself to learn. His slides are available here.
After a short break we heard from Mike Taylor of Microsoft Research who led us through the history of ranking models, from the classic BM25, through ‘black box’ systems using machine learning methods including gradient descent and neural networks. He mentioned LambdaRank which was unfamiliar to most of us (some papers by Burges et al are available on the Microsoft site). Interestingly it seems that the focus at Microsoft has shifted back to probabilistic models and Mike showed examples including a system for predicting ‘real’ clicks on online adverts (as opposed to automatic clicks by web robots).
Thanks to our speakers and everyone who came and we hope to continue what is proving to be a popular series of events. Next is a gathering of those involved in open source search on Tuesday 3rd May – hope to see some of you there.