Open source intranet search over millions of documents with full security
Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.
Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.
You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.
Tags: faceted search, file format, flax, indexing, intranet, open source, security, xapian
This entry was posted on Wednesday, January 26th, 2011 at 11:03 am and is filed under News, Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.
5 Responses to “Open source intranet search over millions of documents with full security”
Jack Parsons 27th January , 2011
How did you handle user access to things like spelling suggestions? You can’t suggest the term “Project Thunderclap” if the user can’t see any files on that topic. “Sorry, nothing matches Project Thunderclap”.
Otis Gospodnetic 28th January , 2011
Yeah, I’m curious, too.
charlie 28th January , 2011
We don’t currently do anything to restrict the spelling suggestions. In theory it is possible that a user could be shown a suggestion from a document they don’t have access to. For this client we’re pretty sure this isn’t a problem.
Tom 28th January , 2011
I feel that single word spelling corrections are not nearly as risky as phrase completion, which this system doesn’t do. Yes, you could find whether “GlobalMegaCorp” was in the index, but you wouldn’t know anything about the context (apart from whether you have access to docs containing it). For this installation, this is an appropriate level of security.
Tweets that mention Open source intranet search over millions of documents with full security -- Topsy.com 28th January , 2011
[...] This post was mentioned on Twitter by Avi Rappoport, Charlie Hull. Charlie Hull said: Blog post: intranet search: an international client,millions of documents,full security http://bit.ly/g7sgvS #opensource #enterprisesearch [...]
Leave a Reply