Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.
Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.
You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.
We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.
We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.
Feedback would of course be very welcome.
When we’re contacted by potential clients, we have to gather as much information as possible about how and why they need search technology. This either takes the form of a physical or telephone meeting and much scribbling in notebooks, or a long exchange of emails. In all cases there are some important questions that must be answered, and I thought it might be useful to list the most common ones here:
How many items do you need to search?
The number of items to search varies widely, from a few thousand to hundreds of millions. This number impacts both the eventual size of a searchable index and how fast it can be built, and will thus inform the eventual system design, both in hardware and software terms. It’s usually possible to search from 5 to 50 million items on a single server – but this also depends on the answer to the next questions:
How big/complex are the items to be searched?
This includes both the size of each item and what data it contains: for example does each item contain a price, or a characteristic like an author’s name, or colour. The item can be part of a group of items, have user tags applied, or be restricted to a certain group of users. The searchable index we build will have to take account of all this information in the correct way, so we can search it effectively.
What other systems must the search engine work with?
Sometimes search engines will have to fit into an existing infrastructure – say an intranet or web application framework – and sometimes they will have to extract information from another system, such as a relational database. The engine may also have to take account of existing security systems, which can impact how each search result is delivered. It may have to deliver search results as a web page, or as a report, or as an email. There’s obviously a huge variety of possible systems to interact with, not least the operating system or platform.
What’s your schedule for delivering a search solution?
This is another key point – it can be relatively quick to build a simple search application, but if the system is going to be very large or very complex, or if a staged delivery based on user feedback is required, then it’s important to know what the expectations are. We’ve installed systems in a couple of days, and built more complex ones over years.
In all cases it’s important to realise that every client will have differing requirements and expectations, and to be sure that everyone ends up satisfied with the end result, the more information we can gather at the start of the process, the better.
One of the challenges we often come up against is indexing data held in other proprietary or open source systems, such as databases or content management systems. Talend is an open source data integration platform that lets you connect to a huge variety of these systems, from Salesforce to Oracle to SugarCRM. Talend is an offshoot of the Eclipse open source community. We’ll be following the development of Talend with interest.
There’s also the related problem of translating file formats before indexing them. Luckily there are lots of open source converters (as used by Omega, part of Xapian), or if you run on a Microsoft platform there’s IFilters – the latter aren’t open source, but you can easily connect to them from another program using COM. In our experience, the IFilters are better at extracting content from Microsoft-specific formats .
UPDATE: I’ve also recently discovered the Tika project, under the Apache umbrella. Not a lot of formats supported so far, but it’s a start.