Posts Tagged ‘file format’

G-Cloud and open file formats, a cautionary tale

We’re lucky enough to have our services available on the G-Cloud, a new initiative by the UK Government’s Cabinet Office with the aim of breaking the sometimes monopolistic practices of ‘big IT’ when supplying government clients. We’ve recently had a couple of contracts procured via the G-Cloud iii framework and one of the requirements is to report whenever a client is invoiced. This is done via a website called Management Information Systems Online (MISO).

Part of the process is to input various mysterious Product Codes, and to find out what these were I downloaded a file from the MISO website. I use the Firefox browser and OpenOffice so I had assumed that opening this file would be a relatively simple process…perhaps unwisely.

Firstly, due to some quirk of the website and/or browser the file arrives with no file extension. I’m assuming it’s some kind of Microsoft Office document so I try renaming it to .xls as an Excel spreadsheet, and open it in OpenOffice Calc. This doesn’t work, as I end up with a load of XML in the spreadsheet cells. As it’s XML I wonder if it’s a newer, XML-powered Office format, so rename to .xlsx, but no, it seems that doesn’t work either. Opening up the file in a text editor shows it’s some kind of XML with Microsoft schemas abounding. At this point I tried contacting the MISO technical support department but they weren’t able to help.

A quick Google and I’ve discovered that the file is probably SpreadsheetML, a file format used before 2007 when Microsoft finally went the whole hog and embraced (well, forced everyone else to embrace) their own XML-based standard for Office documents. The latter format is something OpenOffice can easily read, so I try renaming the file as .xml and importing it. OpenOffice now tells me "OpenOffice.org requires a Java runtime environment (JRE) to perform this task. The selected JRE is defective."

This is now taking far too long. After some more research I discover what this actually means is OpenOffice needs a version of Java 6 (now discouraged by Oracle). I have to register for an Oracle account to even download it. Finally, Open Office is able to read the file and I can now fill in the original form.

If anything this process proves that central government has a long way to go towards adopting open standards and using plain, widely adopted file formats. The G-Cloud framework is a great step forward – but some of the details still need some work.

The trouble with tabbing: editing rich text on the Web

Matt Pearce, who joined the Flax team earlier this year, writes:

A recent client wished to convert documents to and from Microsoft Office formats, using a web form as an intermediate step for editing the content. The documents were read in, imported to a Solr search engine, and could then be searched over, cloned, edited and transformed in batches, before being exported to Office once more.

The content itself was broken down into fields, some of which were simple text or date entry boxes, while others were more complex rich text fields. We opted to use TinyMCE as our rich text editor of choice – it’s small, open source, and easy to extend (we already knew we wanted to write at least one plugin).

The problem arose when the client explained to us that they wanted to use the tab key in rich text fields to create consistent spacing in the text. These needed to display as closely as possible to the original document format, and convert to actual tabs in the Office documents. This presented a number of problems:
By default, the tab key moves the user to the next field on a web page, and needs special handling to prevent this behaviour, especially when it only needs to be applied to certain fields on the page. The spacing had to be consistent, like a word processor’s tab stop. This is tricky when working with proportional fonts, especially in a web form.

The client didn’t want to use an indent feature. The tab only came at the start of the paragraph – beyond that point the text could wrap around to the start of the line. The tab needed to be recognisable in our processing code, so it could be converted to a real tab when it was exported to MS Office.

The preferred solution would have been a document editor like that used for Google Docs. Unfortunately, we didn’t have the time to write the whole input and presentation layer in Javascript as Google have! We also wanted to keep the editing function inside the web application if possible, rather than forcing the user to edit the documents in Microsoft Office and then re-import them every time they needed to make changes.

I started with TinyMCE’s “nonbreaking” plugin, which captures the tab key and converts it to a number of non-breaking spaces. This wasn’t directly suitable for our needs – I discovered that the number of spaces is not always consistent, and they are sometimes converted to regular (rather than non-breaking) spaces. In addition, it doesn’t act like a tab stop – it inserts four spaces wherever you are on the line, which didn’t match the client’s requirement.

I adapted the plugin to insert a <span> into the text, using variable padding to ensure it was the right width. This worked reasonably well, after a not insignificant amount of head scratching trying to work around issues with spacing and space handling. Unfortunately, we struck usability problems when trying to backspace over the tab. The ideal situation would be that a single backspace would remove the entire tab, leaving the user at the start of the line (or the point before they hit the tab key). In fact, a single backspace would leave the user inside the span – two backspaces were required to visibly remove the tab from the editor, and the user could not tell that they were inside the span either. You couldn’t reliably select the “tab” with the mouse either. In addition, Firefox started to behave oddly at this point, putting the cursor in unexpected positions.

My final solution was ugly but workable. We switched to using a monospace font in the rich text editor and, after discussion with the client, started using a variable number of arrow characters to represent the tabs (we actually used , or a closing single quote, if you are reading and writing in German). This made life immediately simpler – dropping the proportional font meant that we didn’t have to worry about getting the width right, just the number of characters to insert. It does mean that in order to remove the tab, the user has to backspace over up to four characters, but the characters are clearly visible: you don’t find yourself inside a span that can’t be seen without viewing the underlying HTML.

While I’m sure this isn’t a unique problem, I couldn’t find anyone else that had been trying to do something similar. I am also not sure whether our choice of rich text editor affected how tricky this problem turned out to be. If anybody reading has suggestions of better approaches to this, we’d be interested to hear from them.

Open source intranet search over millions of documents with full security

Last year my colleague Tom Mortimer talked about indexing security information within an open source enterprise search application, and we’re happy to announce more details of the project. Our client is an international radio supplier, who had considered both closed source products and search appliances, but chose open source for greater flexibility and the much lower cost of scaling to indexes of millions of documents.

Using the Flax platform, we built a high-performance multi-threaded filesystem crawler to gather documents, translated them to plain text using our own open source Flax Filters and captured Unix file permissions and access control lists (ACLs). User logins are authenticated against an LDAP server and we use this to show only the results a particular user is allowed to see. We also added the ability to tag documents directly within the search results page (for example, to mark ‘current’ versions, or even personal favourites) – the tags can then be used to filter future results. Faceted search is also available.

You can read more about the project in a case study (PDF) and Tom’s presentation slides (PDF) explain more about the method we used to index the security information.

Some new open source file filters & previewers

We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.

We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.

Feedback would of course be very welcome.

Tags: , ,

Posted in Technical

March 12th, 2010

No Comments »

Search requirements and asking the right questions

When we’re contacted by potential clients, we have to gather as much information as possible about how and why they need search technology. This either takes the form of a physical or telephone meeting and much scribbling in notebooks, or a long exchange of emails. In all cases there are some important questions that must be answered, and I thought it might be useful to list the most common ones here:

How many items do you need to search?

The number of items to search varies widely, from a few thousand to hundreds of millions. This number impacts both the eventual size of a searchable index and how fast it can be built, and will thus inform the eventual system design, both in hardware and software terms. It’s usually possible to  search from 5 to 50 million items on a single server – but this also depends on the answer to the next questions:

How big/complex are the items to be searched?

This includes both the size of each item and what data it contains: for example does each item contain a price, or a characteristic like an author’s name, or colour. The item can be part of a group of items, have user tags applied, or be restricted to a certain group of users. The searchable index we build will have to take account of all this information in the correct way, so we can search it effectively.

What other systems must the search engine work with?

Sometimes search engines will have to fit into an existing infrastructure – say an intranet or web application framework – and sometimes they will have to extract information from another system, such as a relational database. The engine may also have to take account of existing security systems, which can impact how each search result is delivered. It may have to deliver search results as a web page, or as a report, or as an email. There’s obviously a huge variety of possible systems to interact with, not least the operating system or platform.

What’s your schedule for delivering a search solution?

This is another key point – it can be relatively quick to build a simple search application, but if the system is going to be very large or very complex, or if a staged delivery based on user feedback is required, then it’s important to know what the expectations are. We’ve installed systems in a couple of days, and built more complex ones over years.

In all cases it’s important to realise that every client will have differing requirements and expectations, and to be sure that everyone ends up satisfied with the end result, the more information we can gather at the start of the process, the better.

Tags: , ,

Posted in Business

March 19th, 2009

No Comments »

Open source data integration and file format translation

One of the challenges we often come up against is indexing data held in other proprietary or open source systems, such as databases or content management systems. Talend is an open source data integration platform that lets you connect to a huge variety of these systems, from Salesforce to Oracle to SugarCRM. Talend is an offshoot of the Eclipse open source community. We’ll be following the development of Talend with interest.

There’s also the related problem of translating file formats before indexing them. Luckily there are lots of open source converters (as used by Omega, part of Xapian), or if you run on a Microsoft platform there’s IFilters – the latter aren’t open source, but you can easily connect to them from another program using COM. In our experience, the IFilters are better at extracting content from Microsoft-specific formats .

UPDATE: I’ve also recently discovered the Tika project, under the Apache umbrella. Not a lot of formats supported so far, but it’s a start.

Tags: , , ,

Posted in Technical

February 18th, 2009

2 Comments »