We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.
mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.
This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.
Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.
From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.
A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.
The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.
In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.
As September begins, there are various events coming up that may be of interest to some of our readers. We have a list of conferences we’re attending and/or presenting at. Gartner are running their Portals, Content and Collaboration Summit in mid September in London. Also in London is E Commerce Expo 2009 in late October, which may be of interest as most e-commerce solutions will need some kind of search facility (although in our opinion many fall woefully short, failing to implement such features as spelling correction and synonyms).
For more Enterprise Search events, there’s a calendar provided by Information Today which is pretty exhaustive.
We’re sponsoring a one-day event on open source search – details here, there will be more announced soon. Hope some of you can make it!
You can now see a list of events and conferences we’ll be attending – hope to meet some of you there!
Our technical partners Cognidox have released a whitepaper detailing their view of the enterprise search market, titled “Why you can’t just ‘Google’ for Enterprise Knowledge” – it’s well worth a read. You can download the PDF from their archive.
We finally decided to move entirely to flax.co.uk. The one page remaining is the news archive.
Microsoft have been asking open source companies not to compete on cost, but rather on value, according to ZDNet. Unfortunately the response to this hasn’t exactly been positive, as CNET reports. I doubt many open source vendors will be taking much notice of what Microsoft would like them to do, and suspect they will happily continue to make the point that if customers are looking at buying software & services, taking the cost of software completely out of the equation is almost certain to save them money.
We’ve updated the Flax website with a page showing the Flax software stack – hopefully this will go some way towards explaining how Xapian, Xappy and parts of Flax all fit together. There’s still lots in development so expect some more news later this month.
As part of this, we’ve created a new page bringing together all the Win32 files for Xapian that we maintain – including some pre-built binaries for those of you who don’t want to compile Xapian yourself. We’re working on creating one-click installable packages for bindings for the various languages – however at present we’ve only finished this for Python. Hopefully some users of the other languages will let us know how best to present the other bindings.