Posts Tagged ‘google’
There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.
Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem - “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise – Amazon.com, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.
Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.
I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.
Google have launched a new version of their search appliance this week – this is the GSA of course, not the Google Mini which was canned in summer 2012 (someone hasn’t told Google UK it seems – try buying one though).
Although there’s a raft of new features, most of them have been introduced by the GSA’s competitors over the last few years or are available as open source (entity recognition or document preview for example). The GSA is also not a particularly cheap option as commentators including Stephen Arnold have noticed: we’ve had clients tell us of six-figure license fees for reasonably sized collections of a few millions of documents – and that’s for two years, after which time you have to buy it again. Not surprisingly some people have migrated to other solutions.
However there’s another question that seems to have been missed by Google’s strategists: how a physical appliance can compete with cloud-based search. I can’t think of a single prospective client over the last year or so who hasn’t considered this latter option on both cost and scalability grounds (and we’ll shortly be able to talk about a very large client who have chosen this route). Although there may well be a hard core of GSA customers who want a real box in reassuring Google yellow, one wonders why Google haven’t considered a ‘virtual’ GSA to compete with Amazon’s CloudSearch amongst others.
It will be interesting to see if this version of the GSA is the last…
I visited Enterprise Search Europe for the first day only last week, and caught a number of the presentations as well as giving one of my own (which I won’t discuss here but you’ll hear more about over the next few weeks). First up was Paul Doscher of Lucid Imagination with a lively presentation discussing whether search is either dead or now a commodity, or whether search on Hadoop is the new killer app for the emerging world of Big Data. We then had Kristian Norling from Findwise with some initial results from their survey on enterprise search – some interesting numbers here such as ‘18.5% of users are mostly/very satisfied with search’ and only ‘6% have a search strategy although 46% are planning one’ – we hear that Kristian is hoping to make the survey an annual one, which will be a great resource for anyone in the industry.
Matt Mullen, fuelled by diet cola, gave an introduction to search with a key point – that enterprise search usually performs a role within a workflow or task – a fact often ignored. Runar Buvik of Searchdaimon talked about a great resource he has developed comparing search engines, which can give some often amusing contrasts between different technologies, with some insisting there are no results for a particular query while others find thousands. I also enjoyed Emma Bayne and Donald Phillips polished presentation on the search facilities at the National Archives – interestingly although Autonomy is currently powering their search they are considering open source alternatives.
The day concluded with a presentation from Matt Eichner of Google, who turned up with their own film crew. You can read much of what he said at Computer World. I’m afraid I didn’t enjoy this presentation very much – it talked down to the audience and contained a lot of FUD around open source (surprising when Google uses and supports so much of it) – complete with sympathy-garnering pictures of babies in incubators and silly analogies about how one should prefer to fly in the airplane that cost the most. I hadn’t realised until his talk that the Google Search Appliance appears to be made of cheese!
It was great to network and catch up, and I hope next year to be able to attend the whole event. Thanks to all the organisers especially Martin White of Intranet Focus.
Our customer Cambridge Intellectual Property announced yesterday their new API for a collection of 55 million patents – 48 million more than Google Patents. It’s great to see a Cambridge company innovating in this space, especially as the service is powered by Apache Solr (we’ve given them some small assistance with configuring and tuning this software over the last few months).
The API, available on the Boliven website, offers a REST based service and returns patent data in JSON or XML – so users can easily integrate patent data with their own applications. It can also return PDFs or summaries of the selected patents. In addition, the API will allow users to search and query Boliven’s database of 45+ million science literature documents including journal publications and medical device trials. That’s around 100 million items in total.
Like the Guardian’s Open Platform which I wrote about previously, this is a great example of open source search technology as a platform for new delivery methods – showing how effective (and economical) it can be at this large scale.
It didn’t take me long to find my own small contribution to the patent landscape.
Last night there was a small gathering in Cambridge of open source search engine developers and enthusiasts. Richard Boulton hosted the event and began with an introduction to elasticsearch, which is an “Open Source (Apache 2), Distributed, RESTful, Search Engine built on top of Lucene”. Richard told us about how this system attempts to make prototyping and building search systems easier by automatically guessing data schemas, offering a powerful, heirarchical ‘query language’ and automatically distributing the search load. Richard’s conclusions were that although elasticsearch is not as mature as Apache Solr it is certainly a project to consider: however development is rapid and documentation is not easy to find. We’ll watch this project with interest.
Olly Betts next told us about various Xapian projects running as part of this year’s Google Summer of Code; this led into a discussion of Learning to Rank and how this might be implemented in practical terms. It’s great to see these cutting-edge features being added to an open source project.
Thanks to Richard for organising the evening and to all who came.
As promised here’s a writeup of the day itself. I’ve split this into two parts.
The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.
Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.
Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.
After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.
We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.
I spent part of last week at the 33rd European Conference on Information Retrieval in Dublin, as I had been asked to speak during the Industry Day (of which, more later – far too much useful information to cram into one blog post!). Arriving late afternoon on Wednesday I caught up with Olly Betts of Oligarchy, one of the core Xapian developers who’d travelled from New Zealand. Olly told me more about the Xapian projects running as part of Google’s Summer of Code – very exciting to hear that there were over 40 applicants this year for a limited number of slots.
We went on to the conference banquet at the Lyons Estate outside the city – which in some ways reminded me of Portmeirion – and caught up with people from Google Zurich amongst others. This was one of several fantastic venues organised by the Dublin team led by Cathal Gurrin (at Industry Day itself we were high above the city with great view, and I heard good things about the Guinness Storehouse, the venue for the first day of the conference).
Thanks to all the team (especially Udo Kruschwitz and Tony Russell-Rose for organising Industry Day). I look forward to catching up with some of you at the next BCS IRSG Search Solutions event on November 16th.
Back to London for the next Enterprise Search Meetup, this time featuring Stefan Olafsson of TwigKit and Jeremy Bentley of Smartlogic.
Stefan started off with a brief look at relational databases and search engines, and whether the latter can ever supersede the former. He talked about how modern search technologies such as Apache Solr share many of the same features as the new generation of NoSQL databases, but how in practise one often seems to end up with a combination of search engine and relational database – an experience we share, although we have a small number of customers who have entirely moved away from databases in favour of a search engine.
Jeremy’s talk was an in-depth look at Smartlogic’s products, which include taxonomy creation and management tools, and are designed to complement search engines such as Solr or the GSA. Some interesting points here including the assertion that ‘we trust our content to systems that know nothing about our content’ – i.e. word processors, content storage and management systems – and that we rely on users to add consistent metadata. Smartlogic’s products promise to automate this metadata creation and he had some interesting examples such as the NHS Choices website.
Some interesting discussions followed on the value of taxonomies. Our view is that open taxonomy resources such as Freebase are better than those developed and kept private within organisations, as this can prevent duplication and promote cooperation and the sharing of information. Also, taxonomies often seem to be introduced as a way to fix a broken search experience – maybe fixing the search should be a higher priority.
Thanks to Tyler Tate for organising the event – the tenth in this series of Meetups, and now a regular and much anticipated event in the calendar.
I was recently interviewed by Mitchell Pronschinske for the DZone website on the subjects of open source search: you can download the podcast here. It’s part of a large resource they have on open source search, well worth a browse. We discussed how open source enterprise search has reached parity with closed source solutions, the various options available and what future developments might be.
You can also hear me talk at the European Conference on Information Retrieval (ECIR) in Dublin, as part of Industry Day on Thursday 21st April alongside speakers from Microsoft, Google, Yahoo and IBM amongst others. Do get in touch if you’re attending and would like to meet up for a chat about search over a pint of Guinness!
We dropped in to the Online 2010 event at Olympia this week, and were immediately struck by how quiet the event was: yes, there’s been some terrible weather recently in the UK but there were fewer stalls than last year, a smaller exhibition space and very few exhibitors in the enterprise search space – no Autonomy, Google, Vivisimo or Endeca for example. Unlike previous years there was no dedicated ’search’ area on the exhibition floor, and we did see a few unmanned stands from mid afternoon. Is this is a sign of difficult times or of an event that needs a rethink about its focus?
We didn’t attend the conference that runs next to the exhibition hall this year. This report on the closing panel shows that one question to the panel was about the rise of open source search – not surprisingly, the panel members (all being from closed source companies) weren’t very enthusiastic about this. According to Autonomy open source is only for the commodity end of the market, which is the smallest part. I’m not sure Twitter (1 billion queries a day), LinkedIn (30 million users), The Guardian (innovative open platform) or the Financial Times would agree…