Archive for January, 2011

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Ovum says – why bother with closed source search?

Analysts Ovum have released a report on enterprise search – it’s not clear where to obtain it yet, although Report Linker may have it available. According to one report it may also be called “Enterprise Search and Retrieval: Exploiting all of the Organisation’s Information Assets”.

Interestingly most of the press coverage around the release is focussing on the author, Mike Davis’s statements about open source solutions – in particular “…in fact, companies should only go to the big proprietary players if open source can’t deliver what they need. “. He also states that “there are mere nuances between those ranked” – and this includes the open source option Solr 1.4.

This is the clearest statement yet from an analyst that enterprise search engines are all pretty much the same thing, if you strip away the marketing – but more importantly, that open source should be the first option to consider.

Tags: , , ,

Posted in News

January 21st, 2011

1 Comment »

Background resources for Enterprise Search

If you’re planning an enterprise search project and have no background in the technologies or principles involved, here are some tips to get you started. This isn’t going to be a definitive list so if you know more, please do comment.

There haven’t been a lot of books written on this area over the years, but more are appearing now (especially on open source options). Managing Gigabytes is a good, if slightly elderly, starting point on basic principles. For thoughts on search user interfaces try Peter Morville’s Search Patterns and for an application focus there’s the recent Search Based Applications. For those developing in the Lucene/Solr world there’s the classic (and recently updated) Lucene in Action and the related Solr 1.4 Enterprise Search Server and Building Search Applications: Lucene, LingPipe, and Gate.

Most people will (of course) start their research on the web, although sometimes it’s hard to find nuggets of real information amongst all the marketing. Wikipedia has a list of vendors, including open source solutions, and Avi Rappaport maintains the useful (although not completely up to date) Search Tools website. Some vendors and some open source projects provide FAQs and tutorials (for example the Lucene FAQ, Xapian and Sphinx documentation), which may also contain general information about search principles.

You might also consider joining discussion groups such as the popular LinkedIn Enterprise Search Engine Professionals or a local Meetup group. Training is another option – offered by some vendors and open source companies such as ourselves.

Networking in a great city for enterprise search

Cambridge, U.K. has a long history of hosting search experts and businesses. Back in the 1980s two firms arose – Cambridge CD Publishing, founded by Martin Porter and John Snyder grew into Muscat, and Cambridge Neurodynamics became Autonomy. We believe Smartlogic still have a small office here. Stephen Robertson, co-author of the probabilistic theory of information retrieval (which Xapian uses for ranking) is based here at Microsoft Research.

Today, the city is still home to innovative search companies, including True Knowledge, Grapeshot and of course ourselves. We know of many more ‘under the radar’ developing search technologies both to complement existing systems and as completely new approaches to information retrieval, including visual search.

To encourage networking and to help keep the city at the forefront of search developments, we’ve created the Enterprise Search Cambridge Meetup group and our first meeting is on February 16th – all are welcome, whether currently working with search and related technologies or simply interested in the possibilities. Hope to meet you there!

Tags: , , , ,

Posted in Uncategorized, events

January 14th, 2011

No Comments »