We’ve just released an early version of Flax Filters, which allow basic conversion of various proprietary formats to plain text ready for indexing. Currently the filters support Microsoft Word, Excel and Powerpoint, the Open Office equivalent formats, Adobe PDF, plain text and HTML, but we’ll be adding more in the future (of course, we’d welcome contributions from third parties). We’re already using these filters in some customer installations.
We’ve also created a previewer, so users can see floating previews of the first page of a document in search results. We’ll be adding this feature to a future release of Flax Basic.
Feedback would of course be very welcome.
Posted in Technical.
Tagged with file format, flax, open source.
Last week we heard from various sources that Microsoft had announced they would only be continuing to develop its recently acquired FAST Search technology on Windows. This had long been feared by some in the sector, and it must be worrying for existing customers.
Platform choice can be a key issue for those looking to implement advanced search, as there may be significant existing in-house expertise and investment in a particular platform. Our Flax solution works just as well on Windows, Linux or Solaris. It’s sad to see such a powerful technology as FAST become so narrow in focus, but it’s not particularly surprising after the Microsoft acquisition.
UPDATE: more coverage on this from The Register
Posted in Business, News.
Tagged with FAST, microsoft, open source.
Here are two relatively new networking groups – these are informal gatherings of those who work with enterprise search. I’ve been to the first one and it was very interesting.
London Open Source Social – for those working with open-source enterprise search
Enterprise Search London – more generally for those working in enterprise search
Posted in events.
Tagged with events, networking, open source.
A new year, and a chance to think about what might happen in the world of enterprise search over the next twelve months. I’ll make a stab at some predictions:
- Price cuts – possibly driven by even harsher competition between Google and Microsoft FAST, I can see prices coming down for packaged enterprise search. Autonomy will probably raise theirs
- Real time search matures – not just Twitter or Facebook, but real time data from many sources being part of enterprise search results
- More geolocation-aware search – in the U.K. at least, we’re seeing signs that the source data is finally being freed up, which should make it a lot simpler and cheaper to build location-aware solutions
- A few less second-tier players in the market – it’s still difficult out there, I’m afraid not every company will survive the next year.
You’re welcome to take any of these with a generous pinch of salt!
Posted in Business.
Tagged with autonomy, microsoft, real time.
Back at Online 2009 on Thursday, to take part in the closing panel: “Cloud Computing, Open Source and Semantics: Content and Search Predictions”, moderated by Stephen Arnold. We only touched on four of the ten controversial themes Stephen had prepared: we talked a lot about how ‘Google pressure’ will affect the market, how XML isn’t necessarily the universal panacea for representing data, on the growth of rich media and the challenges it presents and finally on security. Some great questions from the floor as well, thanks to all who came and the organisers and Stephen for inviting us. I wish we’d had more time!
I didn’t agree with Stephen’s main point that Google will crush us all – I think the battles between Google and Microsoft (and Google and everyone else) are a distraction. While they’re fighting it out the rest of us can get on with developing cutting-edge search technologies. Open source search technology gives us tremendous flexibility, allows us to develop solutions very fast, allows the customer to take ownership of the system that’s being developed and now has comparable performance, scalability and commercial support to the traditional closed source world.
The real question is how this will affect the profitability of existing companies in the search space. I wonder who won’t be around at next year’s Online Information show…
Posted in Business, News.
Tagged with events, open source, performance.
I’ve created a page with links to our Flax Newsletters – let us know if you would like to be added to the mailing list (or indeed, if you’d like to be removed from it).
Posted in News.
Tagged with flax.
I visited the Online Information exhibition yesterday at Olympia. My first impression was that the exhibition area was very quiet – and a few of the exhibitors agreed with me. The current financial situation would seem the obvious cause. At previous shows exhibitors have given away all kinds of freebies, from bags, to mini mice, to branded juggling balls….but this year you’d be lucky if you came away with a couple of free pens and a boiled sweet.
I dropped in on the associated conference later, and caught a presentation titled “The Real Time Web: Discovery vs. Search”. Antonio Gulli of Microsoft told us about their new European offices, including one in Soho, that were concentrating on bringing new features to Bing – but the results look very familiar, is Bing doomed to play catch-up? The only ‘real time’ feature he discussed was indexing Twitter, although apparently they’ll soon be indexing Facebook as well. Surely real time encompasses more than these two platforms?
Stephen Arnold gave us his thoughts on what we should mean by ‘real time’, sensibly talking about how the financial services world has been using real time systems for many years. He also injected some notes of caution about how difficult it is to trust information spread amongst peers on social networking sites – here’s a recent case, read further down the page for a great quote from Graham Cluley.
Someone from Endeca (I didn’t catch the name, he was replacing the published speaker) showed us lots of slides of various applications of search, but his theme seemed more about how search can replace traditional databases than about ‘real time’, something I’ve blogged about recently.
We finished with Conrad Wolfram, demonstrating Wolfram Alpha, which isn’t really a search engine but rather a computation engine – it tries to give you a set of answers, rather than a list of possible resources where the answer might be found. Not a lot of ‘real time’ here either.
I’m back on Thursday as part of the closing keynote panel.
Posted in Uncategorized, events.
Tagged with events, real time.
We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.
mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.
This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.
Posted in News.
Tagged with flax, indexing, media.
Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.
From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.
A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.
The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.
In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.
Posted in News, Technical.
Tagged with indexing, real time.
We sponsored Open Source Search Cambridge last week, which went very well, with attendees from as far away as Tokyo and New Zealand, a great variety of talks, presentation and networking and some excellent food!
Shane Evans from mydeco gave a detailed talk on Creating a product search engine, with some interesting details on how query-independent weights are calculate. He was followed by Olly Betts on How Gmane is implemented using Xapian – 72 million messages indexed on a single server! We also had talks from those involved with the Cheshire3 XML search engine, PuppyIR, project to develop search frameworks for children, and found out more about how Glasses Direct have implemented their search using SOLR.
The afternoon consisted of a number of well-attended seminars on search topics, such as comparisons of the various open source search engines available. The day ended with informal networking in a nearby pub.
Based on the feedback we got, there’s definitely interest in a similar event next year – watch this space.
Update: sounds like Search Solutions 2009 was also a good day.
Posted in events.
Tagged with events, lucene, open source, xapian.