Posts Tagged ‘microsoft’
Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.
Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.
Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.
Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.
Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.
The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.
Last night our US partners Lucid Imagination announced that LucidWorks, their packaged and supported version of Apache Lucene/Solr, is available on Microsoft’s Azure cloud computing service. It seems like only a few weeks since Amazon announced their own CloudSearch system and no doubt other ’search as a service’ providers are waiting in the wings (we’re going to need a new acronym as SaaS is already taken!). At first the combination of a search platform based on open source Java code with Microsoft hosting might seem strange, and it raises some interesting questions about the future of Microsoft’s own FAST Search technology – is this final proof that FAST will only ever be part of Sharepoint and never a standalone product? However with search technology becoming more and more of a commodity this is a great option for customers looking for search over relatively small numbers of documents.
Lucid’s offering is considerably more flexible and full-featured than Amazon’s, which we hear is pretty basic with a lack of standard search features like contextual snippets and a number of bugs in the client software. You can see the latter in action at Runar Buvik’s excellent OpenTestSearch website. With prices for the Lucid service ranging from free for small indexes, this is certainly an option worth considering.
Another Cambridge Search Meetup this week, with two speakers on unstructured data, plus the usual networking, beer and snacks. We started with Dean Yearsley of Pingar talking and bravely attempting a live demo of their API, which amongst other things has facilities for entity extraction in multiple languages including English, Chinese and Japanese. The Pingar system is written in .Net and thus unsurprisingly plays well with Sharepoint: Dean demonstrated it automatically providing extra metadata for Sharepoint items, especially useful if a new column has been added to a Sharepoint store, as it would be tedious for operators to have to add data for this column to each item manually.
Jordan Hrycaj of 7Safe, recently acquired by PA Consulting, was up next to talk about what he described as ‘ad-hoc’ search – for use in digital forensics or digital discovery applications. The application he described can be used to search the hard disks of suspect PCs or servers for information such as credit card numbers extremely quickly, working at a low level to avoid leaving any impression on the data (i.e., no file timestamps are altered) and usually working on live systems. This system is command line based, tiny in size and portable across operating systems and is an impressive way to cut down the likely candidates for a data security breach. It was fascinating to hear about a way to search that doesn’t depend on indexing, and the compromises made for performance reasons (i.e., regular expressions can be used but without wildcards).
Thanks to both speakers and to all who came to hear them. We already have some more talks lined up so we expect the next Meetup to be sooner rather than later!
I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.
The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.
Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.
After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.
The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.
It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.
I spent yesterday morning at Ovum’s briefing on Enterprise Search, and they kindly invited me to sit on a discussion panel. One of the more controversial topics raised by analyst Mike Davis was ‘Is Enterprise Search dead?’ which provoked some lively discussion. We also heard from Tyler Tate of Twigkit on Search UX, Exalead on Search Based Applications and Search Technologies on data conditioning and why metadata is so important.
One can’t deny that the search market is going through some huge changes at the moment. Larger vendors are being acquired which can lead to some major (and not always welcome) changes in the product, pricing and service. Smaller vendors are finding it increasingly hard to compete with the plethora of powerful open source solutions (we’ve heard rumours of prices of closed source solutions being dropped radically to attempt to secure new business). There are also some interesting moves towards more comprehensive Business Intelligence and Unified Access solutions, such as Attivio.
I don’t think enterprise search is dying as a market or an offering, simply changing – and hopefully for the better, into an era of more realistic pricing, solutions that actually work (rather than promising ‘magic’) and more openness in terms of the technology and capability.
The blogotweetosphere has been positively buzzing since last night’s announcement that Hewlett Packard will be buying Autonomy for £7.1bn, while divesting itself of its PC business. Many commentators have put a positive spin on this, pointing to Autonomy’s meteoric rise from a small office in Cambridge to the behemoth it is today. It’s undoubtedly good news for Autonomy’s shareholders. Dave Kellogg correctly identifies Autonomy as a “finance company dressed in (meaning-based) technology company clothing” with a “happy ending”.
However the reaction isn’t all positive – the FT implies this deal is at the “lunatic end of the valuation spectrum”. Law Technology News says “Autonomy’s e-discovery revenue stream is high-end but unsustainable” and quotes users of the system with problems: “We had a lot of issues with the applications crashing, the documents tending not to get checked in”….”"[Autonomy sales staff] were pricey, arrogant, and they couldn’t care less about us. … It cannot get any worse.”.
HP will have to work hard to integrate Autonomy into both its corporate culture and software frameworks – a problem currently faced by Microsoft since its acquisition of FAST a short while ago. Stephen Arnold thinks this process will be “risky”. What it means for the rest of the search sector is harder to guess, although Martin White of Intranet Focus says this deal indicates HP can see a “future in search applications” and, interestingly, “A number of privately-held search vendors are probably working out what their valuation would be”.
My view is that this is just the latest of huge shifts in the enterprise search market, partly spurred on by the rise of open source options and the gradual realisation that the huge license fees charged by some vendors may be unsustainable. I don’t think Autonomy will be the last company looking for a safe haven in the years to come.
I’ve uploaded a whitepaper I wrote a short while ago :
“In these rapidly changing times we don’t know what we will need to search tomorrow – so it’s important to be adaptable, flexible and able to cope with data volumes that may not scale linearly. Maintaining control over the future of your search software is also key. Open source search has come of age and every modern business should be aware of its advantages.”
It’s available in our downloads area, together with several case studies on open source search projects we’ve carried out for clients.
As promised here’s a writeup of the day itself. I’ve split this into two parts.
The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.
Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.
Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.
After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.
We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.
Last night was the second Cambridge search meetup, held in a (rather noisy as it turned out) pub close to the river. It was great to see so many new faces from a wide range of backgrounds including bioinformatics, rare books and academic publishing.
First of the talks was from Tyler Tate of TwigKit, who described the typical search process as a ‘funnel’, narrowing the available options to an eventual conclusion. He told us how the original definition of search removed the user from the picture, and how to improve things we should make it easy to organise, annotate and compare search results to allow both the user and the system itself to learn. His slides are available here.
After a short break we heard from Mike Taylor of Microsoft Research who led us through the history of ranking models, from the classic BM25, through ‘black box’ systems using machine learning methods including gradient descent and neural networks. He mentioned LambdaRank which was unfamiliar to most of us (some papers by Burges et al are available on the Microsoft site). Interestingly it seems that the focus at Microsoft has shifted back to probabilistic models and Mike showed examples including a system for predicting ‘real’ clicks on online adverts (as opposed to automatic clicks by web robots).
Thanks to our speakers and everyone who came and we hope to continue what is proving to be a popular series of events. Next is a gathering of those involved in open source search on Tuesday 3rd May – hope to see some of you there.
I was recently interviewed by Mitchell Pronschinske for the DZone website on the subjects of open source search: you can download the podcast here. It’s part of a large resource they have on open source search, well worth a browse. We discussed how open source enterprise search has reached parity with closed source solutions, the various options available and what future developments might be.
You can also hear me talk at the European Conference on Information Retrieval (ECIR) in Dublin, as part of Industry Day on Thursday 21st April alongside speakers from Microsoft, Google, Yahoo and IBM amongst others. Do get in touch if you’re attending and would like to meet up for a chat about search over a pint of Guinness!