Open source data integration and file format translation

One of the challenges we often come up against is indexing data held in other proprietary or open source systems, such as databases or content management systems. Talend is an open source data integration platform that lets you connect to a huge variety of these systems, from Salesforce to Oracle to SugarCRM. Talend is an offshoot of the Eclipse open source community. We’ll be following the development of Talend with interest.

There’s also the related problem of translating file formats before indexing them. Luckily there are lots of open source converters (as used by Omega, part of Xapian), or if you run on a Microsoft platform there’s IFilters – the latter aren’t open source, but you can easily connect to them from another program using COM. In our experience, the IFilters are better at extracting content from Microsoft-specific formats .

UPDATE: I’ve also recently discovered the Tika project, under the Apache umbrella. Not a lot of formats supported so far, but it’s a start.

Tags: , , ,

This entry was posted on Wednesday, February 18th, 2009 at 4:07 pm and is filed under Technical. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

One Response to “Open source data integration and file format translation”

  1. If you are looking for open source data integration specifically designed to support semantic or enterprise search project challenges, then here are a few links that might be of interest:
    http://www.eclipse.org/smila ….this project will http://aperture.sourceforge.net/

Leave a Reply

  • « Older Entries
  • Newer Entries »