Posts Tagged ‘intranet’

Search Solutions 2013, a review

Yesterday was the always interesting Search Solutions one day conference held by the BCS IRSG in London, a mix of talks on different aspects of search. The first presentation was by Behshad Behzadi of Google on Conversational Search, where he showed a speech-capable search interface that allowed a ‘conversation’ with the search engine – context being preserved – so the query “where are Italian restaurants in Chelsea” followed by “no I prefer Chinese” would correctly return results about Chinese restaurants. The demo was impressive and we can expect to see more of this kind of technology as smartphone adoption rises. Wim Nijmeijer of Coveo followed with details of how their own custom connectors to a multitude of repositories could enable Complex enterprise search delivered in a day. This of course assumes that no complex mapping of fields or schemas from the source to the search engine index is necessary, which I suspect it often is – I’m not alone in being slightly suspicious of the supposed timescale. Nikolaos Nanas from Thessaly in Greece then presented on Adaptive Information Filtering: from theory to practise which I found particularly interesting as it described filtering documents against a user’s interest with the latter modelled by an adaptive, weighted network – he showed the Noowit personalised magazine application as an example. With over 1000 features per user and no language specific requirements this is a powerful idea.

After a short break we continued with a talk by Henning Rode on CV Search at TextKernel. He described a simple yet powerful UI for searching CVs (resumes) with autosuggest and automatic field recognition (type in “Jav” and the system suggests “Java” and knows this is a programming language or skill). He is also working on systems to autogenerate queries from job vacancies using heuristics. We’ve worked in the recruitment space ourselves so it was interesting to hear about their approach, although the technical detail was light. Following Henning was Dermot Frost talking about Information Preservation and Access at the Digital Repository of Ireland and their use of open source technology including Solr and Blacklight to build a search engine with a huge variety of content types, file formats and metadata standards across the items they are trying to digitally preserve. Currently this is a relatively small collection of data but they are planning to scale up over the next few years: this talk reminded me a little of last year’s by Emma Bayne of the UK’s National Archive.

After lunch we began a session named Understanding the User, beginning with Filip Radlinski of Microsoft Research. He discussed Sensitive Online Search Evaluation (with arXiv.org as a test collection) and how interleaved results is a powerful technique for avoiding bias. Next was Mounia Lalmas of Yahoo! Labs on what makes An Engaging Click (although unfortunately I had to pop out for a short while so I missed most of what I am sure was a fascinating talk!). Mags Hanley was next on Understanding users search intent with examples drawn from her work at TimeOut – the three main lessons being to know the content in context, the time of year and the users’ mental model in context. Interestingly she showed how the most popular facets used differed across TimeOut’s various international sites – in Paris the top facet was perhaps unsurprisingly ‘cuisine’, in London it was ‘date’.

After another short break we continued with Helen Lippell’s talk on Enterprise Search – how to triage problems quickly and prescribe the right medicine – her five main points being analyze user needs, fix broken content, focus on quick wins in the search UI, make sure you are able to tweak the search engine itself in a documentable fashion and remember the importance of people and process. Her last point ‘if search is a political football, get an outsider perspective’ is of course something we would agree with! Next was Peter Wallqvist of Ravn Systems on Universal Search and Social Networking where he focussed on how to allow users to interact directly with enterprise content items by tagging, sharing and commenting – so as to derive a ‘knowledge graph’ showing how people are connected by their relationships to content. We’ve built systems in the past that have allowed users to tag items in the search result screen itself so we can agree on the value of this approach. Our last presenter with Kristian Norling of Findwise on Reflections on the 2013 Enterprise Search Survey – some more positive news this year, with budgets for search increasing and 79% of respondents indicating that finding information is of high importance for their organisation. Although most respondents still have less than one full time staff member working on search, Kristian made the very good point that recruiting just one extra person would thus give them a competitive advantage. Perhaps as he says we’ve now reached a tipping point for the adoption of properly funded enterprise search regarded as an ongoing journey rather than a ‘fire and forget’ project.

The day finished with a ‘fishbowl’ session, during which there was a lot of discussion of how to foster links between the academic IR community and industry, then the BCS IRSG AGM and finally a drinks reception – thanks to all the organisers for a very interesting and enlightening day and we look forward to next year!

A belated report on Enterprise Search Europe 2013

Earlier this month I attended the third Enterprise Search Europe conference, this time not to speak but to run workshops, panels, tracks and social events. On Tuesday a colleague and I gave a workshop on Getting the Best from Open Source Search which I hope was useful to attendees: one thing I did take away is how the level of experience with open source and indeed search technology itself can vary widely: some attendees had already experimented widely with Apache Lucene/Solr and some simply wanted to expand their knowledge of the associated risks & benefits of this approach.

The first day of the conference started with Ed Dale of Ernst & Young talking about implementing enterprise search for a truly global organisation. E&Y’s search is over a surprisingly small number of documents (only 2 million or so) but they are lucky enough to have a relatively large and experienced team running their search as an ongoing operation – no ‘fire and forget’ here (an approach often taken and seldom successfully). We moved on to hear from Kristian Norling on the second year of Findwise’s Enterprise Search Survey (some interesting numbers with the full results available soon) and then a fascinating and amusing talk from Joe Lamantia on the Language of Discovery, backed up by a second talk from Tyler Tate – it seems Discovery might a better term for what we call Search, at least from a usability perspective. The morning ended with Steven Arnold’s provocative take on how the performance of search technology hasn’t improved measurably in many decades due to processing limitations and how the rise of Big Data is only going to compound the problem.

The afternoon began with a panel session on the future of open source search – my personal thanks to Daniel Lee of Artirix, Eric Pugh of Open Source Connections and René Kriegler for leading a lively discussion on the seemingly inexorable rise of open source search and what may happen next. There were some interesting points raised on how significant investment in open source search may change the picture. We continued in the open source theme with talks on open source solutions for the City of Antibes and Shopping24, before a drinks reception and then moving to the pub across the road for the combined London and Cambridge Search Meetup. Our theme was ‘The Nightmare before Search’ – some great (and unbloggable!) war stories on crazy search implementations was followed by networking late into the night.

The next day continued with a session on search implementation from speakers including Dan Foster of Legal & General, a track on Big Data during which we heard from Eric Pugh on building a very large scale system using open source software – sadly I had to drop out at this point for meetings and only returned for the closing plenary sessions. I particularly enjoyed Kara Pernice’s insights on how to build usable intranet search and Valentin Richter’s session on migrating to a new search technology (a topic on many minds especially for those using FAST ESP which goes out of mainstream support in a couple of months). Lynda Moulton did her best to sum up what we had learnt over the last few days – a very hard job when the event covered so many aspects of search & discovery.

Many thanks to Information Today and chair Martin White as ever for organising the event – although it was an intense few days it was great to catch up with everyone and to talk search. We’re looking forward to next year – did I hear a rumour that the Europe in the title might be more emphasized next time? We shall see!

Tags: , , , ,

Posted in events

May 28th, 2013

No Comments »

Search Solutions 2012 – a review

Last Thursday I spent the day at the British Computer Society’s Search Solutions event, run by their Information Retrieval Specialist Group. Unlike some events I could mention, this isn’t a forum for sales pitches, over-inflated claims or business speak – just some great presentations on all aspects of search and some lively networking or discussion. It’s one of my favourite events of the year.

Milad Shokouhi of Microsoft Research started us off showing us how he’s worked on query trend analysis for Bing: he showed us how some queries are regular, some spike and go and some spike and remain – and how these trends can be modelled in various ways. Alex Jaimes of Yahoo! Barcelona talked about a human centred approach to search – I agree with his assertion that “we’re great at adapting to bad technology” – still sadly true for many search interfaces! Some of the demographic approaches have led to projects such as Yahoo! Clues which is worth a look.

Martin White of Intranet Focus was up next with some analysis of recent surveys and research, leading to some rather doom-laden conclusions about just how few companies are investing sufficiently in search. Again some great quotes: “Information Architects think they’ve failed if users still need a search engine” and a plea for search vendors (and open source exponents) to come clean about what search can and can’t do. Emma Bayne of the National Archives was next with a description of their new Discovery catalogue, a similar presentation to the one she gave earlier in the year at Enterprise Search Europe. Kristian Norling of Findwise finished with a laconic and amusing treatment of the results from Findwise’s survey on enterprise search – indicating that those who produce systems that users are “very satisfied” usually do the same things, such as regular user testing and employing a specialist internal search team.

Stella Dextre Clark talked next about a new ISO standard for thesauri, taxonomies and their interopability with other vocabularies – some great points on the need for thesauri to break down language barriers, help retrieval in enterprise situations where techniques such as PageRank aren’t so useful and to access data from decades past. Leo Sauermann was next with what was my personal favourite presentation of the day, about a project to develop a truly semantic search engine both for KDE Linux and currently the Cloud. This system, if more widely adopted, promises a true revolution in search, as relationships between data objects are stored directly by the underlying operating system. I spoke next about our Clade taxonomy/classification system and our Flax Media Monitor, which I hope was interesting.

Nicholas Kemp of DSTL was up next exploring how they research new technologies and approaches which might be of interest to the defence sector, followed by Richard Morgan of Funnelback on how to empower intranet searchers with ways to improve relevance. He showed how Funnelback’s own intranet allows users to adjust multiple factors that affect relevance – of course it’s debatable how these may be best applied to customer situations.

The day ended with a ‘fishbowl’ discussion during which a major topic was of course the Autonomy/HP debacle – there seemed to be a collective sense of relief that perhaps now marketing and hype wouldn’t dominate the search market as much as it had previously…but perhaps also that’s just my wishful thinking! All in all this was as ever an interesting and fun day and my thanks to the IRSG organisers for inviting me to speak. Most of the presentations should be available online soon.

The death of enterprise search is reported, again

There’s no doubt that the search market has been in turmoil for many months now: traditional, closed source vendors are either frantically repositioning to avoid the ‘juggernaut that is Apache’s Solr/Lucene project’ or attempting to bore customers to death with Powerpoint. Our sources tell us that in the UK at least, sales of most closed source search engines have flatlined – not at all surprising when freely available alternatives exist. Luckily there are some parts of the sector with some energy: Attivio (with $34m of new funding to spend) and Lucidworks are still working hard on their search products, but even these rely heavily on an open source core.

Enter a company without any history or experience in the search market, Huddle, with a tired message about the death of Enterprise Search. I’m not entirely sure what the point of this article is, but apparently the lack of contextual information is the problem - “You have to do research in 50 places — email, Web, C-drives, the cloud, even inside people’s heads.”. I look forward to a brain-compatible indexing tool! There’s also the misassumption that what works for the wider consumer-focused Web will work for the enterprise – Amazon.com, Google and the iPad/iPhone are all namechecked. Enterprise data simply isn’t like web or consumer data – it’s characterised by rarity and unconnectedness rather than popularity and context.

Unfortunately in most enterprises simply sprinkling on social or collaborative features will not fix the most common search problems: a mishmash of unconnected legacy systems, unreliable and inconsistent metadata, a complex and untested security model (at least within the context of being able to search for everything, for example your bosses’ salary) and usually the lack of a dedicated team responsible for search. Enterprise Search is hard and few projects get beyond basic indexing of filestores and databases, let along adding in more people-focused features.

I couldn’t find much about search on Huddle’s website, but what I did find implied that information must first be extracted from existing legacy systems and stored centrally. If you can manage this, preserving a consistent metadata model, coping with legacy formats, preserving full security and coping with updates then search should be relatively simple to implement on the resulting central store; however the devil is as ever in the detail.

Tags: , , , , ,

Posted in News

October 25th, 2012

No Comments »

Building bridges in the Cloud with open source search

We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.

C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.

ECIR 2011 Industry Day – part 1 of 2

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Intranet search event

Intranet Search was the theme for a small gathering last night at the (rather imposing) Ministry of Justice in London. We heard from Luke Oatham on intranet search at the Ministry itself, powered by Google over a reasonably small set of static and hand-published HTML. Simon Thompson continued with a neat way of enhancing Sharepoint search, using JQuery to create an auto-complete tool for his company intranet, which interestingly displayed both ‘people’ and ‘page’ results in the same drop-down menu. Tyler Tate couldn’t make it to the event due to bad weather, but bravely volunteered to present over Skype on a (surprisingly good) 3G connection, and talked about handling diverse data (video, slides). Next up was our very own Tom Mortimer talking about indexing security information (of which more later) and we finished up with a quick talk from Rangi Robinson on the intranet at Framestore, with search powered by the open source Sphinx project.

Thanks to Simon Thompson and Angel Brown for organising the event and inviting us to speak.

Tags: , , ,

Posted in events

December 3rd, 2010

No Comments »

Find out how we build in document security for open source search

My colleague Tom Mortimer will be talking at the London Intranet Show & Tell on 2nd December, about how to implement document-level security for search: his presentation is titled “Implementing ACLs in an open source search solution”.

There are still a few tickets left for this small event, which will be of value to those working on intranet search.

Tags: , , ,

Posted in events

November 12th, 2010

No Comments »

Log analysis and adaptive search

I attended an interesting talk by Udo Kruschwitz on Adaptive Intranet Search last night as part of the Enterprise Search London Meetup. Udo has built a search engine for the University of Essex and has been investigating how to help users to refine their query using techniques such as suggesting related terms (there’s a similar feature in Xapian called ‘top terms’ – here’s an example). As part of this he’s done a great deal of analysis of query and session logs, and is building up expertise on automatically maintained domain knowledge – moving away from the traditional model of manually maintained networks relating one word or phrase to another. For example, his system is learning automatically that when users type “map” into the search box, they really want to search for “campus map”. The number of documents in his test collection is small and the volume of searches is low; it will be interesting to see how these ideas scale to larger collections and groups of users.

The group was small, informal and seemed to consist mainly of those with expertise in implementing search solutions – no sales or marketing here, just a group of people discussing how best to get the job done.

Tags: , ,

Posted in Technical, events

July 30th, 2010

No Comments »