Posts Tagged ‘media’

Search Solutions 2011 review

I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling’s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.

Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup

Last night we held the first of what we hope will be a series of Meetups in our home town of Cambridge, U.K. Attending were researchers, developers and entrepreneurs in the field of search – as is the norm in Cambridge many had cycled to the venue, and there was a friendly and informal feel to the group.

We started with my presentation on “Searching news media with open source software”, where I talked about our work for the NLA, Financial Times and others. We followed with John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”. John showed a Grapeshot project for Virgin where different media assets can be automatically grouped together even if they have different metadata – for example an episode of the TV show “Heroes” is basically the same object whether it is broadcast, video-on-demand or a repeat, but differs from the Bowie album of the same name.

We then broke up for discussion (and beer) – great to catch up with some ex-colleagues and meet others for the first time. Downstairs there was live music and one of our colleagues even joined the band for a spell on drums! From the feedback we recieved there’s definitely interest in repeating the event, if you’d like to attend next time please join the Meetup group.

Tags: , , , ,

Posted in events

February 17th, 2011

1 Comment »

Next-generation media monitoring with open source search

Media monitoring is not a traditional search application: for a start, instead of searching a large number of documents with a single query, a media monitoring application must search every incoming news story with potentially thousands of queries, searching for words and terms relevant to client requirements. This can be difficult to scale, especially when accuracy must be maintained – a client won’t be happy if their media monitors miss relevant stories or send them news that isn’t relevant.

We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).

The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.

As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.

Tags: , , , , , ,

Posted in News

December 13th, 2010

No Comments »

Autumn events

Autumn seems to be conference season: first is the Lucene Revolution event in Boston, USA from October 7th-8th, where I’ll be on the closing panel whose subject is “Data Crossroads – At The Intersection Of Search And Open Source”.

Next is the British Computer Society’s Search Solutions 2010 in London on October 21st, where I’m giving a presentation titled “What’s the story with open source? – Searching and monitoring news media with open-source technology”.

Both events feature a wide range of other speakers from organisations such as Cisco, LinkedIn, Twitter, Google and Microsoft.

Tags: , , , ,

Posted in events

September 10th, 2010

No Comments »

The Times they are a-changing….

News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.

The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.

Tags: , , ,

Posted in Business, News

March 26th, 2010

No Comments »

Finding French TV with Flax

We’ve recently been working with mySkreen, who like Hulu in the U.S. provide a service for finding and viewing television programs via your web browser. mySkreen is the brainchild of Frédéric Sitterlé, previously Head of New Media at the Le Figaro media group.

mySkreen works with French-language content, and is currently indexing over 1.6 million programmes (and counting). Using Flax, you can search using programme title, actors, genres or time periods. We also added some innovative query parsing to translate fuzzy queries such as ‘tomorrow evening’ into more exact time periods, and some clever ranking so that ‘more easily available’ programmes appear higher in the search results. We also added faceted search and automatic spelling correction.

This was a fast-moving project with a very quick turnaround: we first visited mySkreen in Paris in August and delivered customised code to them less than four weeks later; the flexibility of Flax and the open source model helped to make this possible.

Tags: , ,

Posted in News

November 26th, 2009

No Comments »