Posts Tagged ‘intranet’

Building bridges in the Cloud with open source search

We’ve just published a case study on our work for C Spencer Ltd., a UK-based civil engineering company who take a pro-active approach to document management – instead of taking the default Sharepoint route or buying another product off the shelf, they decided to create their own in-house system based on open source components, hosted on the Amazon AWS Cloud. We’ve helped them integrate Apache Solr to provide full text search across the millions of items held in the document management system, with a sub-second response. Their staff can now find letters, contracts, emails and designs quickly via a web interface.

C Spencer are known for their innovative and modern approach – they’re even building their own green power station on a brownfield site in Hull. It’s thus not surprising that they chose cutting-edge open source technology for search: tracking and managing documents correctly is extremely important to their business.

ECIR 2011 Industry Day – part 1 of 2

As promised here’s a writeup of the day itself. I’ve split this into two parts.

The first presentation was from Martin Szummer of Microsoft Research on ‘Learning to Rank’. I’d seen some of the content before, presented by Mike Taylor at our own Cambridge Search Meetup, but Martin had the chance to go into more detail about a ‘recipe’ for learning to rank a set of results, using gradient descent. One application he suggested was merging lists of results from different, although related queries: for example in a situation where users don’t know how best to phrase a query, the engine can suggest alternatives (“jupiter’s mass” / “mass of jupiter”), carry out several searches and merge the results to provide a best result. Some fascinating ideas here although it may be a while before we see practical applications in enterprise search.

Doug Aberdeen of Google was next with a description of the Gmail Priority Inbox. The system looks at 300 features of email to attempt to predict what is ‘important’ email for each user – starting with a default global model (so it works ‘out of the box’) and then adjusting slightly over time. Some huge challenges here due to the scale of the problem (we all know how big Gmail is!) and also due to the fact that the team can’t debug with ‘real’ data – as everyone’s email is private. Luckily various Googlers have allowed their email accounts to be used for testing.

Richard Boulton of Celestial Navigation followed with a discussion of some practical search problems he’s encounted, in particular when working on the Artfinder website. Some good lessons here: “search is a commodity”, “a search system is never finished” and “search providers have different aims to users”. He discussed how he developed an ‘ArtistRank’ to solve problems of what exactly to return for the query ‘Leonardo’, and how eventually a four-way classification system was developed for the site. One good tip he had for debugging ranking was an ‘explain view’ showing exactly how positions in a list of results are calculated.

After a short break we had Tyler Tate, who again spoke recently in Cambridge, so I won’t repeat his slides again here. Next was Martin White of Intranet Focus, introducing his method for benchmarking search results within organisations. He suggested that search within enterprises is often in a pretty bad state – which our experience at Flax bears out – and showed a detailed checklist approach to evaluating and improving the search experience. His checklist has a theoretical maximum score of 270, sadly very few companies manage more than 50 points.

We then moved to lunch – I’ll write about the afternoon sessions in a subsequent post.

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Intranet search event

Intranet Search was the theme for a small gathering last night at the (rather imposing) Ministry of Justice in London. We heard from Luke Oatham on intranet search at the Ministry itself, powered by Google over a reasonably small set of static and hand-published HTML. Simon Thompson continued with a neat way of enhancing Sharepoint search, using JQuery to create an auto-complete tool for his company intranet, which interestingly displayed both ‘people’ and ‘page’ results in the same drop-down menu. Tyler Tate couldn’t make it to the event due to bad weather, but bravely volunteered to present over Skype on a (surprisingly good) 3G connection, and talked about handling diverse data (video, slides). Next up was our very own Tom Mortimer talking about indexing security information (of which more later) and we finished up with a quick talk from Rangi Robinson on the intranet at Framestore, with search powered by the open source Sphinx project.

Thanks to Simon Thompson and Angel Brown for organising the event and inviting us to speak.

Tags: , , ,

Posted in events

December 3rd, 2010

No Comments »

Find out how we build in document security for open source search

My colleague Tom Mortimer will be talking at the London Intranet Show & Tell on 2nd December, about how to implement document-level security for search: his presentation is titled “Implementing ACLs in an open source search solution”.

There are still a few tickets left for this small event, which will be of value to those working on intranet search.

Tags: , , ,

Posted in events

November 12th, 2010

No Comments »

Log analysis and adaptive search

I attended an interesting talk by Udo Kruschwitz on Adaptive Intranet Search last night as part of the Enterprise Search London Meetup. Udo has built a search engine for the University of Essex and has been investigating how to help users to refine their query using techniques such as suggesting related terms (there’s a similar feature in Xapian called ‘top terms’ – here’s an example). As part of this he’s done a great deal of analysis of query and session logs, and is building up expertise on automatically maintained domain knowledge – moving away from the traditional model of manually maintained networks relating one word or phrase to another. For example, his system is learning automatically that when users type “map” into the search box, they really want to search for “campus map”. The number of documents in his test collection is small and the volume of searches is low; it will be interesting to see how these ideas scale to larger collections and groups of users.

The group was small, informal and seemed to consist mainly of those with expertise in implementing search solutions – no sales or marketing here, just a group of people discussing how best to get the job done.

Tags: , ,

Posted in Technical, events

July 30th, 2010

No Comments »