Posts Tagged ‘networking’
Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.
Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.
Thanks to Richard for organising the evening and to all who came.
Here’s the second writeup.
We started after lunch with a talk from Flavio Junqueira of Yahoo! on web search engine cacheing. He talked both about the various things that can be cached (query results, term lists and document data) and the pros and cons of dynamic versus static caching. His work has focused on the former, with a decoupled approach – i.e. the cache doesn’t automatically know what’s changed in the index. The approach is to give data in the cache a ‘time to live’ (TTL), after which it is refreshed – an acceptable approach as search engines don’t have a ‘perfect’ view of the web at any one point in time. As he mentioned, this method is less useful for ‘real-time’ data such as news.
Francesco Calabrese followed, talking about his work in the IBM Smarter Cities Technology Centre in Dublin itself. Using data from mobile devices his group has looked at ‘digital footprints’ and how they might be used to better understand such things as public transport provision. An interesting effect they have noticed is that they can predict the type of an event (say a football match) from the points of origin of the attendees. This talk wasn’t really about search, although the data gathered would be useful in search applications with geolocation features.
Gery Ducatel from BT was next, with a description of a search application for their mobile workforce, allowing searches over a job database as well as reference and health & safety information. This had some interesting aspects, not least with the user interface – you can’t type long strings wearing heavy gloves while halfway up a telegraph pole! The system uses various NLP features such as a part-of-speech tagger to break down a query and provide easy-to-use dropdown options for potential results. The user interface, while not the prettiest I’ve seen, also made good use of geolocation to show where other engineers had carried out nearby jobs.
I followed with my talk on Unexpected Search, which I’ll detail in a future blog post. We then moved onto a panel discussion on the IBM Watson project – suffice it to say that although I’ve been asked about this a lot in the last few months, it seems to me that this was a great PR coup for IBM rather than a huge leap forward in the technology (which by the way includes the open source Lucene search engine).
Thanks again to Udo and Tony for organising the day, and for inviting me to speak – there was a fascinating range of speakers and topics, and it was great to catch up with others working in the industry.
As promised here’s a writeup of the day itself. I’ve split this into two parts.
The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.
Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.
Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.
After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.
We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.
I spent part of last week at the 33rd European Conference on Information Retrieval in Dublin, as I had been asked to speak during the Industry Day (of which, more later – far too much useful information to cram into one blog post!). Arriving late afternoon on Wednesday I caught up with Olly Betts of Oligarchy, one of the core Xapian developers who’d travelled from New Zealand. Olly told me more about the Xapian projects running as part of Google’s Summer of Code – very exciting to hear that there were over 40 applicants this year for a limited number of slots.
We went on to the conference banquet at the Lyons Estate outside the city – which in some ways reminded me of Portmeirion – and caught up with people from Google Zurich amongst others. This was one of several fantastic venues organised by the Dublin team led by Cathal Gurrin (at Industry Day itself we were high above the city with great view, and I heard good things about the Guinness Storehouse, the venue for the first day of the conference).
Thanks to all the team (especially Udo Kruschwitz and Tony Russell-Rose for organising Industry Day). I look forward to catching up with some of you at the next BCS IRSG Search Solutions event on November 16th.
Last night we held the first of what we hope will be a series of Meetups in our home town of Cambridge, U.K. Attending were researchers, developers and entrepreneurs in the field of search – as is the norm in Cambridge many had cycled to the venue, and there was a friendly and informal feel to the group.
We started with my presentation on “Searching news media with open source software”, where I talked about our work for the NLA, Financial Times and others. We followed with John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”. John showed a Grapeshot project for Virgin where different media assets can be automatically grouped together even if they have different metadata – for example an episode of the TV show “Heroes” is basically the same object whether it is broadcast, video-on-demand or a repeat, but differs from the Bowie album of the same name.
We then broke up for discussion (and beer) – great to catch up with some ex-colleagues and meet others for the first time. Downstairs there was live music and one of our colleagues even joined the band for a spell on drums! From the feedback we recieved there’s definitely interest in repeating the event, if you’d like to attend next time please join the Meetup group.
A quick reminder that our first Cambridge Enterprise Search Meetup is tomorrow, February 16th from 6.30pm. More details in my previous post. We now have two talks, one from myself on “Open Source Search for News” and one from John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”.
If you’re able to come please let us know using the Meetup website so we can organise enough refreshments!
Another excellent evening as part of the Enterprise Search London Meetup series; very busy as usual.
Amir Dotan started us off with details of his work in designing user interfaces for the financial services sector, describing some of the challenges involved in designing for a high-pressure and highly regulated environment. Although he didn’t talk about search specifically we heard a lot about how to design useful interfaces. Two quotes stood out: “The right user interface can help make billions”, and as a way to get feedback “find someone nice in the business and never let them go”.
Gregory Grefenstette of Exalead was next, talking about his new book on Search Based Applications. He explained how SBAs have advantages over traditional databases in the three areas of agility, usability and performance and went on to show some examples, before an unfortunate combination of a broken slide deck and a failing laptop battery brought him to a halt: in retrospect a great advertisement for a physical book over a computer!
Upayavira of Sourcesense was next with details of a new search built for online news aggregator Moreover. This dealt with scaling Lucene/Solr to cope with indexing 2 million new documents a day, for a rolling 2 month index. He showed how some initial memory and performance problems had been solved with a combination of pre-warming caches, tweaks to the JVM and Java garbage collector and eventually profiling of their custom code. Particularly interesting was how they had developed a system for spinning up a complete copy of the searchable database (for load balancing purposes) on the Amazon EC2 cloud – from a standing start they can allocate servers, install software and copy across searchable indexes in around 40 minutes. This was a great demonstration of the power of the open source model – no more licenses to buy! Search performance over this large collection is pretty good as well, with faceted queries returning in a second or two and unfaceted in half a second.
We also heard from Martin White about an exciting new search related conference to be held in October this year in London in association with Information Today, Inc., and I managed a quick plug for our inaugural Cambridge Enterprise Search Meetup on Wednesday 16th February.
If you’re planning an enterprise search project and have no background in the technologies or principles involved, here are some tips to get you started. This isn’t going to be a definitive list so if you know more, please do comment.
There haven’t been a lot of books written on this area over the years, but more are appearing now (especially on open source options). Managing Gigabytes is a good, if slightly elderly, starting point on basic principles. For thoughts on search user interfaces try Peter Morville’s Search Patterns and for an application focus there’s the recent Search Based Applications. For those developing in the Lucene/Solr world there’s the classic (and recently updated) Lucene in Action and the related Solr 1.4 Enterprise Search Server and Building Search Applications: Lucene, LingPipe, and Gate.
Most people will (of course) start their research on the web, although sometimes it’s hard to find nuggets of real information amongst all the marketing. Wikipedia has a list of vendors, including open source solutions, and Avi Rappaport maintains the useful (although not completely up to date) Search Tools website. Some vendors and some open source projects provide FAQs and tutorials (for example the Lucene FAQ, Xapian and Sphinx documentation), which may also contain general information about search principles.
You might also consider joining discussion groups such as the popular LinkedIn Enterprise Search Engine Professionals or a local Meetup group. Training is another option – offered by some vendors and open source companies such as ourselves.
Cambridge, U.K. has a long history of hosting search experts and businesses. Back in the 1980s two firms arose – Cambridge CD Publishing, founded by Martin Porter and John Snyder grew into Muscat, and Cambridge Neurodynamics became Autonomy. We believe Smartlogic still have a small office here. Stephen Robertson, co-author of the probabilistic theory of information retrieval (which Xapian uses for ranking) is based here at Microsoft Research.
Today, the city is still home to innovative search companies, including True Knowledge, Grapeshot and of course ourselves. We know of many more ‘under the radar’ developing search technologies both to complement existing systems and as completely new approaches to information retrieval, including visual search.
To encourage networking and to help keep the city at the forefront of search developments, we’ve created the Enterprise Search Cambridge Meetup group and our first meeting is on February 16th – all are welcome, whether currently working with search and related technologies or simply interested in the possibilities. Hope to meet you there!
I attended an interesting talk by Udo Kruschwitz on Adaptive Intranet Search last night as part of the Enterprise Search London Meetup. Udo has built a search engine for the University of Essex and has been investigating how to help users to refine their query using techniques such as suggesting related terms (there’s a similar feature in Xapian called ‘top terms’ – here’s an example). As part of this he’s done a great deal of analysis of query and session logs, and is building up expertise on automatically maintained domain knowledge – moving away from the traditional model of manually maintained networks relating one word or phrase to another. For example, his system is learning automatically that when users type “map” into the search box, they really want to search for “campus map”. The number of documents in his test collection is small and the volume of searches is low; it will be interesting to see how these ideas scale to larger collections and groups of users.
The group was small, informal and seemed to consist mainly of those with expertise in implementing search solutions – no sales or marketing here, just a group of people discussing how best to get the job done.