Posts Tagged ‘enterprise search’

The four types of open source search project

As I’m currently writing content for our new Flax website (which is taking far longer than anticipated for various reasons I won’t bore you with) I’ve been thinking about the sort of projects we encounter at Flax. You might find this useful if you’re planning or starting a search project with Solr or Elasticsearch. Note that not everything we do fits cleanly into these four categories!

The search idea

So you’ve got this idea and you’re convinced that you need search as part of the puzzle, but you’re not sure where it fits, whether it will be performant or how to gather and transform your data so it’s ready for searching. Perhaps you’re from a startup, or maybe part of a skunkworks projects in a larger organisation. What you need is someone who really understands search software and what can be done with it to sit with you for a day or two, validate your technical choices, help you understand how to shape your data, even play with some basic indexing.

The proof of concept

You’re a little further along – you know what technology you’ll be using and you have some data all ready for indexing. However, before your funders or boss will release more budget you need to build something they can see (and search) – you’ll need an indexer and a basic search application. You could do it yourself but time is limited and you’ve not built a search application before. You’re expecting to spend a week or two developing something to show others, that lets them search real data and see real results. You might also want to experiment with scale – see what happens to performance when you add a few million items to the index, even if the schema isn’t quite right yet.

The big one

You’re building the big one – indexing complex data or many millions of items, and/or for a huge user base. You need to be very sure your indexing pipeline is fast, scales well, copes with updates and can transform data from many sources. You need to develop the very best search schema. Your search architecture must be resilient, cope with heavy load, failover cleanly and give the correct results. You’re assembling a team to build it but you need specialist help from people who have built this kind of system at scale before.

The migration

Finally you’ve secured budget to move away from the slow and innacurate search engine that everyone hates! Search really does suck, but you now have a chance to make it better. However, although you know how to keep the old engine running you don’t have much experience of open source search. Even though the old engine isn’t great, you’re doing a lot of business with it and you want to be confident that relevance is as good (and hopefully better) with the new engine – maybe you want to develop a testing framework?

We’re also increasingly delivering training (both for business users who want to know the capabilities of open source search and for technical users who want to improve their knowledge – we can tailor this to your requirements) and ongoing support – but everything starts with a search project of some kind!

When bad search hurts: finding that elusive ROI

One thing I’ve noticed from many years attending search conferences is that the return on investment (ROI) in search technology is hard to calculate: this is particularly difficult when considering intranet and/or enterprise search, as users can usually find another way to answer their question. In most cases, some slightly tired numbers are trotted out from years-old studies, based on how much employee time a better search engine might save. I’ve never really believed this to be a sensible metric; however if you’re trying to sell a search solution (especially an overpriced, closed source, magic black box that ‘understands your content’) it may be all you can rely on.

We’ve been working recently with a couple of clients who sell significant volumes of products via their websites: in both cases the current search solution is underperforming. In this situation it’s far easier to justify an investment in better search: if customers can’t find products they simply won’t be buying them. Simple changes to the search algorithms can have huge impacts, costing or saving the company millions in revenue. However, installing a new search engine is just the beginning – it’s vital to consider a solid test strategy. To start with, the new engine should be at least as good as the old one in terms of relevance. The search logs will show what terms your users have typed into the search box, and these can be used to construct a set of queries that can be tested against both engines, with the results being scored by content experts, beta testers and even small groups of real customers. The results of this scoring can be used to inform how the new engine can be tuned – and of course, this should be an ongoing process, as search should never be built as a ‘fire and forget’ project.

There’s sadly little available on the subject of real-world relevance testing, although there’s a forthcoming book by Doug Turnbull and and John Berryman. Based on our work with the clients above, we hope later this year to be able to talk further about relevance testing and tuning – and how to do it right for e-commerce, avoiding significant financial risk.

UPDATE: Doug has been kind enough to give our readers a discount code for his book – “39turnbull” – not sure how long this will last but it gives you 39% off the price so worth a try!

Tags: , , ,

Posted in Uncategorized

April 29th, 2015

1 Comment »

IntraTeam 2015 – a brief visit

Last week I dropped in on the IntraTeam 2015 conference in Copenhagen, an event focused on intranets with some content on enterprise search. After a rather pleasant evening of Thai food and networking I attended the last day of the event. The keynote speaker was Dave Snowden, who has an amusing and rather curmudgeonly style of presentation, making sure to note the previous presenters he’d disagreed with for their over-reliance on simplistic concepts of knowledge and how the brain works. His talk was however very interesting and introduced the Cynevin framework (a Welsh word which apparently refers to homing sheep!). He also discussed how the rush to digitisation has had a cost in terms of human cognition, how the concept of an intranet will soon disappear (a brave assertion at an intranet conference) and how future systems should perhaps use storytelling metaphors – with some great examples of how collecting these micro-narratives from employees and others can produce extremely rapid feedback on the health of a business.

Andreas Hallgren of Chalmers University showed the evolution of their site-wide search facility, now based on Apache Solr. Unsurprisingly one of the main problems was determining who ‘owns’ search in their organisation: at least now they have a staff member who dedicates 25% of their time to improving search. He had some interesting points about the seasonality of academic searches and how analytics can be used to ‘measure more, guess less’. I was up next talking about Search Turned Upside Down, using a similar set of slides to this one: thanks to all who came and asked some great questions.

Next was Helen Lippell who I have heard speak before on how to get Enterprise Search right – Helen had some great anecdotes and guidance for an attentive audience. Ed Dale followed with five tips for great search: index the right content, optimise this content, measure search, make a great UI and listen to your users – I can only agree! He also characterised the different kinds of content including the worrying ‘content we think we have but we don’t’. The last presentation I attended was by Anders Quitzau of IBM on their fascinating Watson technology: sadly this was a rather marketing-heavy set of slides, with plenty of newly minted buzzwords such as Cognitive Computing and very little useful detail.

Thanks to Kurt Kragh Sorenson and Kristian Norling for inviting me to speak and attend the conference, next time I hope to see a little more of the event!

Search Solutions 2013, a review

Yesterday was the always interesting Search Solutions one day conference held by the BCS IRSG in London, a mix of talks on different aspects of search. The first presentation was by Behshad Behzadi of Google on Conversational Search, where he showed a speech-capable search interface that allowed a ‘conversation’ with the search engine – context being preserved – so the query “where are Italian restaurants in Chelsea” followed by “no I prefer Chinese” would correctly return results about Chinese restaurants. The demo was impressive and we can expect to see more of this kind of technology as smartphone adoption rises. Wim Nijmeijer of Coveo followed with details of how their own custom connectors to a multitude of repositories could enable Complex enterprise search delivered in a day. This of course assumes that no complex mapping of fields or schemas from the source to the search engine index is necessary, which I suspect it often is – I’m not alone in being slightly suspicious of the supposed timescale. Nikolaos Nanas from Thessaly in Greece then presented on Adaptive Information Filtering: from theory to practise which I found particularly interesting as it described filtering documents against a user’s interest with the latter modelled by an adaptive, weighted network – he showed the Noowit personalised magazine application as an example. With over 1000 features per user and no language specific requirements this is a powerful idea.

After a short break we continued with a talk by Henning Rode on CV Search at TextKernel. He described a simple yet powerful UI for searching CVs (resumes) with autosuggest and automatic field recognition (type in “Jav” and the system suggests “Java” and knows this is a programming language or skill). He is also working on systems to autogenerate queries from job vacancies using heuristics. We’ve worked in the recruitment space ourselves so it was interesting to hear about their approach, although the technical detail was light. Following Henning was Dermot Frost talking about Information Preservation and Access at the Digital Repository of Ireland and their use of open source technology including Solr and Blacklight to build a search engine with a huge variety of content types, file formats and metadata standards across the items they are trying to digitally preserve. Currently this is a relatively small collection of data but they are planning to scale up over the next few years: this talk reminded me a little of last year’s by Emma Bayne of the UK’s National Archive.

After lunch we began a session named Understanding the User, beginning with Filip Radlinski of Microsoft Research. He discussed Sensitive Online Search Evaluation (with arXiv.org as a test collection) and how interleaved results is a powerful technique for avoiding bias. Next was Mounia Lalmas of Yahoo! Labs on what makes An Engaging Click (although unfortunately I had to pop out for a short while so I missed most of what I am sure was a fascinating talk!). Mags Hanley was next on Understanding users search intent with examples drawn from her work at TimeOut – the three main lessons being to know the content in context, the time of year and the users’ mental model in context. Interestingly she showed how the most popular facets used differed across TimeOut’s various international sites – in Paris the top facet was perhaps unsurprisingly ‘cuisine’, in London it was ‘date’.

After another short break we continued with Helen Lippell’s talk on Enterprise Search – how to triage problems quickly and prescribe the right medicine – her five main points being analyze user needs, fix broken content, focus on quick wins in the search UI, make sure you are able to tweak the search engine itself in a documentable fashion and remember the importance of people and process. Her last point ‘if search is a political football, get an outsider perspective’ is of course something we would agree with! Next was Peter Wallqvist of Ravn Systems on Universal Search and Social Networking where he focussed on how to allow users to interact directly with enterprise content items by tagging, sharing and commenting – so as to derive a ‘knowledge graph’ showing how people are connected by their relationships to content. We’ve built systems in the past that have allowed users to tag items in the search result screen itself so we can agree on the value of this approach. Our last presenter with Kristian Norling of Findwise on Reflections on the 2013 Enterprise Search Survey – some more positive news this year, with budgets for search increasing and 79% of respondents indicating that finding information is of high importance for their organisation. Although most respondents still have less than one full time staff member working on search, Kristian made the very good point that recruiting just one extra person would thus give them a competitive advantage. Perhaps as he says we’ve now reached a tipping point for the adoption of properly funded enterprise search regarded as an ongoing journey rather than a ‘fire and forget’ project.

The day finished with a ‘fishbowl’ session, during which there was a lot of discussion of how to foster links between the academic IR community and industry, then the BCS IRSG AGM and finally a drinks reception – thanks to all the organisers for a very interesting and enlightening day and we look forward to next year!

Search Solutions 2011 review

I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.