Archive for the ‘News’ Category

Building a new press cuttings service for the Financial Times

Those of you who read my slides from Search Solutions 2010 will have spotted a case study on our work for the Financial Times, one of the world’s leading business news organisations.

When the Financial Times decided to bring their digital press cuttings in-house in summer 2010, they asked us to build a powerful ’search server’ that they could easily integrate into their existing product offerings.

We built an indexer for the XML source data and a RESTful Web Service API, offering search features including Boolean operators, phrase searches, area specifiers (search whole article, body, headline, byline or any combination), date range restrictions, similarity search (“articles like this one”) and faceted search. Also available is spelling correction and synonyms, and detailed logging of indexing and all searches.

This might sound like a complex task, but using open source technology we created this system within less than a fortnight. Initially designed as a small-scale prototype, the system scaled easily to indexing hundreds of thousands of pages. You can use the service at
http://presscuttings.ft.com.

Tags: , , , ,

Posted in News, Technical

October 25th, 2010

1 Comment »

Search Solutions 2010 – a brief review

I spent yesterday at Search Solutions 2010, hosted by the British Computer Society. They’d been kind enough to ask me to speak (Update: my slides are available here, the rest are available at the event website above), but there were plenty of other people to listen to as well. There’s a great blow-by-blow account from Tyler Tate already, but here are some personal highlights:

Google’s Behshad Behzadi spoke about freshness for web content and how Google’s usual ranking strategy favours older results over new ones – as the new ones don’t have so many links. Vishwa Vinay from Microsoft talked on what to do with click data in enterprise search – he listed lots of papers on the subject, hopefully his slides will be published so we can follow them up. He made the point that any ‘adaptive’ ranking based on click data must still work well out of the box, before any clicks have happened. This section of the event finished with Vivian Lin Dufour of Yahoo!, talking about some ways of guiding searchers from within the UI, with auto-suggest and similar techniques. Apparently the research the Yahoo team are doing on trending has let them spot news stories 12-24 hours before they hit the papers. I wondered afterwards, is this current fad for ‘trendspotting’ turning search engines into just a media channel? I don’t care much about the X-Factor TV show myself, so why should this current trend influence my search results?

Nick Patience started the next session talking about trends in the Enterprise Search market: he acknowledged the rapid rise of open source solutions and talked about how search-based applications will become increasingly important, with a huge market for ‘information governance’ solutions opening up. Chirag Ghandhi of Mphasis, a search integrator, talked about how customers are disillusioned with enterprise search, and how difficult it is to build solutions that cope with data from a range of different sources and in different languages. Dusan Rnic of Endeca stressed the importance of being able to handle the ‘long tail’ of search results – the ones that aren’t the most popular and showed us his favourite website – strangely enough, an Endeca customer.

Greg Lyndahl talked about how Blekko have built an innovative web crawling/indexing framework, which has enabled them to build up a 3 billion page index very efficiently – looking forward to seeing more of this. As he said, what they’re doing isn’t necessarily better than Google, but it’s certainly different. My talk on open source search for news content followed, and then Roberto Cornacchia showed us Spinque’s approach to building search platforms – encapsulating search expert knowledge into logical ‘blocks’ that can be combined by domain experts into the solutions they actually need.

The last session began with Till Kinstler of GBV Common Library Network, a self-described ‘library hacker’, on building a search system using the open source engine Solr over 25 million library records – they’re now aiming for 120 million, taken from 400 different libraries, in source formats going all the way back to tape and paper library cards! We then heard about the Information Retrieval Facility, an open IR research institution – I liked their three principles of ‘open science, open source, open market’. The talks finished with Rob Stacey on True Knowledge’s ways of checking the veracity of facts gathered from the internet.

We then moved on to an open panel – some great themes here including the rise of search as a platform for new applications, what exciting (or scary) things Facebook might bring to the world of search, and how we should all work harder to bring good information retrieval mechanisms to those who cannot currently access them due to poverty, language barriers or disability.

Thanks to the BCS IRSG and in particular to Udo Kruschwitz for a very interesting and enlightening day.

Tags: , , , ,

Posted in News, events

October 22nd, 2010

4 Comments »

A revolution indeed

I’m at the Lucene Revolution conference in Boston, USA for the next few days – and it’s aptly named. If there’s anyone out there who still doubts that open source search is a serious alternative to a commercial engine, the numbers and other information coming out of this event will be convincing. Twitter are now using Lucene to handle a billion queries a day; LinkedIn and SalesForce.com are already veterans with similarly huge installations. The conversations I’m having and overhearing are about billions of documents, tens of thousands of users, all easily handled by open source search.

The other big news here is that Lucid Imagination have released software to fill in most if not all of the gaps between Lucene/Solr and the closed-source competition – it’s called LucidWorks Enterprise and adds a detailed administration UI, a REST API, crawlers, scaling functionality and much more. I’m looking forward to getting my hands on a demo and showing it off when back in the UK.

There’s an optimistic, buzzing energy at this event – a real feeling that we’re here at the beginning of something big. More revolutionary news to come!

Tags: , , ,

Posted in News, events

October 7th, 2010

1 Comment »

Flax partners with Lucid Imagination

We’re very happy to announce that we’ve been selected as an Authorized Partner by Lucid Imagination, the commercial company for Lucene and Solr. You can read the press release as a PDF here.

Apache Lucene and Solr, available as open source software from the Apache Software Foundation, are powerful, scalable, reliable and fully-featured search technologies. Solr is the Lucene Search Server, making it easy to build search applications for the enterprise.

With our long experience of customising, installing and supporting open source search engines, this partnership is a natural fit for us, and we’re excited by the opportunities it presents. In addition to our current offerings, Flax will now offer installation, integration and commercial support packages for Lucene and Solr, backed by Lucid Imagination.

Tags: , , , ,

Posted in Business, News

October 4th, 2010

No Comments »

The Times they are a-changing….

News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.

The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.

Tags: , , ,

Posted in Business, News

March 26th, 2010

No Comments »

FAST drops Linux & Unix support – no surprise?

Last week we heard from various sources that Microsoft had announced they would only be continuing to develop its recently acquired FAST Search technology on Windows. This had long been feared by some in the sector, and it must be worrying for existing customers.

Platform choice can be a key issue for those looking to implement advanced search, as there may be significant existing in-house expertise and investment in a particular platform. Our Flax solution works just as well on Windows, Linux or Solaris. It’s sad to see such a powerful technology as FAST become so narrow in focus, but it’s not particularly surprising after the Microsoft acquisition.

UPDATE: more coverage on this from The Register

Tags: , ,

Posted in Business, News

February 9th, 2010

2 Comments »

Online Information 2009, day 3

Back at Online 2009 on Thursday, to take part in the closing panel: “Cloud Computing, Open Source and Semantics: Content and Search Predictions”, moderated by Stephen Arnold. We only touched on four of the ten controversial themes Stephen had prepared: we talked a lot about how ‘Google pressure’ will affect the market, how XML isn’t necessarily the universal panacea for representing data, on the growth of rich media and the challenges it presents and finally on security. Some great questions from the floor as well, thanks to all who came and the organisers and Stephen for inviting us. I wish we’d had more time!

I didn’t agree with Stephen’s main point that Google will crush us all – I think the battles between Google and Microsoft (and Google and everyone else) are a distraction. While they’re fighting it out the rest of us can get on with developing cutting-edge search technologies. Open source search technology gives us tremendous flexibility, allows us to develop solutions very fast, allows the customer to take ownership of the system that’s being developed and now has comparable performance, scalability and commercial support to the traditional closed source world.

The real question is how this will affect the profitability of existing companies in the search space. I wonder who won’t be around at next year’s Online Information show…

Tags: , ,

Posted in Business, News

December 4th, 2009

No Comments »

Flax Newsletters

I’ve created a page with links to our Flax Newsletters – let us know if you would like to be added to the mailing list (or indeed, if you’d like to be removed from it).

Tags:

Posted in News

December 2nd, 2009

No Comments »

Finding French TV with Flax

We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.

mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.

This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.

Tags: , ,

Posted in News

November 26th, 2009

No Comments »

When real-time search isn’t

Avi Rappoport writes about ‘real-time’ search, a popular subject at the moment. Twitter search is one example of this kind of application, where a stream of new content is arriving very quickly.

From a search engine developer’s point of view there are various things to consider: how quickly new content must become searchable, how to balance this against performance demands and how to rank the results.

A lot of search engine architectures are built on the assumption that indexes won’t need to be updated very often, sacrificing index freshness for search speed, so constantly adding new content is expensive in terms of performance. One approach is to maintain several indexes: a small, fresh one and some older, static ones, with the fresh index periodically being merged into the older static set. Searches must be made across all these indexes of course, with care taken to maintain accurate statistics and thus relevancy ranking.

The question of ranking is also an interesting one: in a ‘real-time’ situation, how should we present the results – does ‘more recent’ always trump ‘more relevant’? As always, a combination of both is probably the best default approach, with an option available to the user to choose one or the other.

In any case there will always be some delay between content being published and being searchable – the trick is to keep this to the minimum, so it appears as ‘real-time’ as possible.

Tags: ,

Posted in News, Technical

November 5th, 2009

2 Comments »