One of the challenges we often come up against is indexing data held in other proprietary or open source systems, such as databases or content management systems. Talend is an open source data integration platform that lets you connect to a huge variety of these systems, from Salesforce to Oracle to SugarCRM. Talend is an offshoot of the Eclipse open source community. We’ll be following the development of Talend with interest.
There’s also the related problem of translating file formats before indexing them. Luckily there are lots of open source converters (as used by Omega, part of Xapian), or if you run on a Microsoft platform there’s IFilters – the latter aren’t open source, but you can easily connect to them from another program using COM. In our experience, the IFilters are better at extracting content from Microsoft-specific formats .
UPDATE: I’ve also recently discovered the Tika project, under the Apache umbrella. Not a lot of formats supported so far, but it’s a start.