The Cambridge Enterprise Search Meetup last night featured Francis Rowland of the European Bioinformatics Institute and Rob Stacey of TrueKnowledge, in a newly refurbished venue. Thanks to all those who came and it was good to meet some new faces.
Francis talked about how search user interfaces should try not to restrict the user’s ‘flow’ of activity, as search is after all only a means to and end. Among the wealth of material he mentioned was the Endeca User Interface Design Pattern Library and what is sure to be a very useful upcoming book, Search Analytics for Your Site.
Rob told us about how TrueKnowledge provides a semantic question answering system – trying to understand the goal(s) of someone asking the system a question such as “is Madonna single?”. He also mentioned how this kind of technology might be applied to an enterprise environment, for example to answer questions like “has the invoice for last Thursday’s job been paid?”. Rob’s talk sparked off a very active Q&A session, with the audience raising issues such as how TrueKnowledge’s method might be applied to languages other than English and how to model the trustworthiness of their sources, which include Wikipedia.
Francis’ slides are now online – with some great sketchnotes of Rob’s talk as well! Thanks to both our speakers.
Here’s the second writeup.
We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.
Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.
Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.
I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).
Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.
Another excellent evening as part of the Enterprise Search London Meetup series; very busy as usual.
Amir Dotan started us off with details of his work in designing user interfaces for the financial services sector, describing some of the challenges involved in designing for a high-pressure and highly regulated environment. Although he didn’t talk about search specifically we heard a lot about how to design useful interfaces. Two quotes stood out: “The right user interface can help make billions”, and as a way to get feedback “find someone nice in the business and never let them go”.
Gregory Grefenstette of Exalead was next, talking about his new book on Search Based Applications. He explained how SBAs have advantages over traditional databases in the three areas of agility, usability and performance and went on to show some examples, before an unfortunate combination of a broken slide deck and a failing laptop battery brought him to a halt: in retrospect a great advertisement for a physical book over a computer!
Upayavira of Sourcesense was next with details of a new search built for online news aggregator Moreover. This dealt with scaling Lucene/Solr to cope with indexing 2 million new documents a day, for a rolling 2 month index. He showed how some initial memory and performance problems had been solved with a combination of pre-warming caches, tweaks to the JVM and Java garbage collector and eventually profiling of their custom code. Particularly interesting was how they had developed a system for spinning up a complete copy of the searchable database (for load balancing purposes) on the Amazon EC2 cloud – from a standing start they can allocate servers, install software and copy across searchable indexes in around 40 minutes. This was a great demonstration of the power of the open source model – no more licenses to buy! Search performance over this large collection is pretty good as well, with faceted queries returning in a second or two and unfaceted in half a second.
We also heard from Martin White about an exciting new search related conference to be held in October this year in London in association with Information Today, Inc., and I managed a quick plug for our inaugural Cambridge Enterprise Search Meetup on Wednesday 16th February.
Last night I went to another excellent Enterprise Search London Meetup, at Skinkers near London Bridge. I’d been at the Online show all day, which was rather tiring, so it was great to sit down with beer and nibbles and hear some excellent speakers.
Max Wilson kicked off with a talk on exploratory search and ’searching for leisure’. His Search Interface Inspector looks like a fascinating resource, and we heard about how he and his team have been constructing a taxonomy for the different kinds of search people do, using Twitter as a data source.
Martina Schell was next with details of Travel Match, a holiday search engine that’s trying to do for holidays what our customer Mydeco is doing for interior design: scrape/feed/gather as much holiday data as you can, put it all into a powerful search engine and build innovative interfaces on top. They’ve tried various interfaces including a ‘visual search’, but after much user testing have reined back their ambitions somewhat – however they’re still unique in allowing some very complex queries of their data. Interestingly, one challenge they identified is how to inform users that one choice (say, airport to fly from) may affect the available range of other choices (say, destinations) – apparently users often click repeatedly on ‘greyed-out’ options, unsure as to why they’re not working…
The inimitable Stephen Arnold concluded the evening with a realistic treatment of the current fashion for ‘real-time’ search. His point was that unless you’re Google, with their fibre-connected, hardware-accelerated gigascale architecture, you’re not going to be able to do real-time web search or anything close to it; on a smaller scale, for financial trading, military and other serious applications you again need to rely on the hardware – so for proper real-time (that means very close to zero latency), your engineering capability, not your software capability is what counts. I’m inclined to agree – I trained as an electronic engineer and worked on digital audio, back when this was also only possible with clever hardware design. Of course, eventually the commodity hardware gets fast enough to move away from specialised devices, and at this point even the laziest coder can create responsive systems, but we’re far away from that point. Perhaps the marketing departments of some search companies should take note – if you say you can do real-time indexing, we’re not going to believe you.
Thanks again to Tyler Tate and all at TwigKit for continuing to organise and support this excellent event.
I spent yesterday at Search Solutions 2010, hosted by the British Computer Society. They’d been kind enough to ask me to speak (Update: my slides are available here, the rest are available at the event website above), but there were plenty of other people to listen to as well. There’s a great blow-by-blow account from Tyler Tate already, but here are some personal highlights:
Google’s Behshad Behzadi spoke about freshness for web content and how Google’s usual ranking strategy favours older results over new ones – as the new ones don’t have so many links. Vishwa Vinay from Microsoft talked on what to do with click data in enterprise search – he listed lots of papers on the subject, hopefully his slides will be published so we can follow them up. He made the point that any ‘adaptive’ ranking based on click data must still work well out of the box, before any clicks have happened. This section of the event finished with Vivian Lin Dufour of Yahoo!, talking about some ways of guiding searchers from within the UI, with auto-suggest and similar techniques. Apparently the research the Yahoo team are doing on trending has let them spot news stories 12-24 hours before they hit the papers. I wondered afterwards, is this current fad for ‘trendspotting’ turning search engines into just a media channel? I don’t care much about the X-Factor TV show myself, so why should this current trend influence my search results?
Nick Patience started the next session talking about trends in the Enterprise Search market: he acknowledged the rapid rise of open source solutions and talked about how search-based applications will become increasingly important, with a huge market for ‘information governance’ solutions opening up. Chirag Ghandhi of Mphasis, a search integrator, talked about how customers are disillusioned with enterprise search, and how difficult it is to build solutions that cope with data from a range of different sources and in different languages. Dusan Rnic of Endeca stressed the importance of being able to handle the ‘long tail’ of search results – the ones that aren’t the most popular and showed us his favourite website – strangely enough, an Endeca customer.
Greg Lyndahl talked about how Blekko have built an innovative web crawling/indexing framework, which has enabled them to build up a 3 billion page index very efficiently – looking forward to seeing more of this. As he said, what they’re doing isn’t necessarily better than Google, but it’s certainly different. My talk on open source search for news content followed, and then Roberto Cornacchia showed us Spinque’s approach to building search platforms – encapsulating search expert knowledge into logical ‘blocks’ that can be combined by domain experts into the solutions they actually need.
The last session began with Till Kinstler of GBV Common Library Network, a self-described ‘library hacker’, on building a search system using the open source engine Solr over 25 million library records – they’re now aiming for 120 million, taken from 400 different libraries, in source formats going all the way back to tape and paper library cards! We then heard about the Information Retrieval Facility, an open IR research institution – I liked their three principles of ‘open science, open source, open market’. The talks finished with Rob Stacey on True Knowledge’s ways of checking the veracity of facts gathered from the internet.
We then moved on to an open panel – some great themes here including the rise of search as a platform for new applications, what exciting (or scary) things Facebook might bring to the world of search, and how we should all work harder to bring good information retrieval mechanisms to those who cannot currently access them due to poverty, language barriers or disability.
Thanks to the BCS IRSG and in particular to Udo Kruschwitz for a very interesting and enlightening day.
Peter Morville has created a Flickr collection of ’search patterns’, showing the different kind of search interfaces available. I can highly recommend you take a look if you’d like some good examples of clustering, faceted navigation, auto-suggest and interfaces for certain sectors such as e-commerce. We often find these concepts difficult to explain to customers without some real-world examples.