media – Flax

FIBEP WMIC 2015 – Open source search for media monitoring with Solr

Charlie Hull — Thu, 19 Nov 2015 16:23:46 +0000

FIBEP WMIC 2015 – How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform from Charlie Hull

The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.

Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning

Charlie Hull — Wed, 04 Nov 2015 11:48:33 +0000

I’ll be speaking at several events over the next few weeks, in the UK and abroad. On the 19th of November I’ll be at the FIBEP World Media Intelligence Congress in Vienna, to talk about how we helped our client Infomedia migrate from a closed-source search engine (Autonomy IDOL and Verity) to a new platform based on Apache Lucene/Solr and our own Luwak stored search library. Infomedia are Denmark’s leading provider of media monitoring and analysis and wanted to future-proof their search platform: we’ll talk about open source makes this possible and how we implemented stored search, handled highly complex queries and how the new platform is scalable and flexible.

On the 25th I’ll be presenting at the London Elasticsearch Usergroup with our client Westcoast, who we have been helping with an Elasticsearch implementation. Westcoast are a B2B supplier of electronics and white goods with yearly revenues of over £1billion, and we’ve helped them implement a powerful new search engine for their website. E-commerce is one sector where good search is an essential part of driving revenue.

Next, on the 26th I’ll be talking one of my favourite events of the year, the British Computer Society Information Retrieval Specialist Group’s Search Solutions, on how we might improve how search engine relevance is tested. I’ll suggest a more formal process of test-based relevance tuning and show some useful tools. Our client NLA media access are also talking about the new Clipshare platform we built on Apache Lucene/Solr.

Do let me know if you’re attending and would like to chat – I’ll also be publishing slides and more information about the projects above soon.

The post Talks: Replacing Autonomy IDOL with Solr, Elasticsearch for e-commerce & relevancy tuning appeared first on Flax.

Eleven years of open source search

Charlie Hull — Mon, 30 Jul 2012 13:43:33 +0000

It’s now eleven years since we started Flax (initially as Lemur Consulting Ltd) in late July 2001, deciding to specialise in search application development with a focus on open source software. At the time the fallout from the dotcom crash was still evident and like today the economic picture was far from rosy. Since few people even knew what a search engine was (Google was relatively new and had only started selling advertising a year before) it wasn’t always easy for us to find a market for our services.

When we visited clients they would list their requirements and we would then tell them how we believed open source search could help (often having to explain the open source movement first). Things are different these days: most of our enquiries come from those who have already chosen open source search software such as Apache Lucene/Solr but need our help in installing, integrating or supporting it. There’s also a rise in those clients considering applications and techniques outside the traditional site search or intranet search – web scraping and crawling for data aggregation, taxonomies and automatic classification, automatic media monitoring and of course massive scalability, distributed processing and Big Data. Even the UK government are using open source search.

So after all this time I’m tending to agree with Roger Magoulas of O’Reilly: open source won, and we made the right choice all those years ago.

The post Eleven years of open source search appeared first on Flax.

Media monitoring with open source search – 20 times faster than before!

Charlie Hull — Wed, 25 Jul 2012 09:39:38 +0000

We’re happy to announce we’ve just finished a successful project for a division of the Australian Associated Press to replace a closed source search engine with a considerably more powerful open source solution. You can read the press release here.

As our client had a large investment in stored searches (which represent a client’s interests) which were defined in the query language of their previous search engine, we first had to build a modified version of Apache Lucene that replicated exactly this syntax. I’ve previously blogged about how we did this. However this wasn’t the only challenge: search engines are designed to be good at applying a few queries to a very large document collection, not necessarily at applying tens of thousands of stored queries to every single new document. For media monitoring applications this kind of performance is essential as there may be hundreds of thousands of news articles to monitor every day. The system we’ve built is capable of applying tens of thousands of stored queries every second.

With the rapid increase in the volume of content that media monitoring companies have to check for their clients – today’s news isn’t just in print, but online, in social media and indeed multimedia – it may be that open source software is the only way to build monitoring systems that are economically scalable, while remaining accurate and flexible enough to deliver the right results to clients.

The post Media monitoring with open source search – 20 times faster than before! appeared first on Flax.

Search backwards – media monitoring with open source search

Charlie Hull — Thu, 08 Mar 2012 12:00:45 +0000

We’re working with a number of clients on media monitoring solutions, which are a special case of search application (we’ve worked on this previously for Durrants). In standard search, you apply a single query to a large amount of documents, expecting to get a ranked list of documents that match your query as a result. However in media monitoring you need to search each incoming document (for example, a news article or blog post) with many queries representing what the end user wants to monitor – and you need to do this quickly as you may have tens or hundreds of thousands of articles to monitor in close to real time (Durrants have over 60,000 client queries to apply to half a million articles a day). This ‘backwards’ search isn’t really what search engines were designed to do, so performance could potentially be very poor.

There are several ways around this problem: for example in most cases you don’t need to monitor every article for every client, as they will have told you they’re only interested in certain sources (for example, a car manufacturer might want to keep an eye on car magazines and the reviews in the back page of the Guardian Saturday magazine, but doesn’t care about the rest of the paper or fashion magazines). However, pre-filtering queries in this way can be complex especially when there are so many potential sources of data.

We’ve recently managed to develop a method for searching incoming articles using a brute-force approach based on Apache Lucene which in early tests is performing very well – around 70,000 queries applied to a single article in around a second on a standard MacBook. On suitable server hardware this would be even faster – and of course you have all the other features of Lucene potentially available, such as phrase queries, wildcards and highlighting. We’re looking forward to being able to develop some powerful – and economically scalable – media monitoring solutions based on this core.

The post Search backwards – media monitoring with open source search appeared first on Flax.

Search Solutions 2011 review

Charlie Hull — Thu, 17 Nov 2011 16:29:38 +0000

I spent yesterday at the British Computer Society Information Retrieval Specialist Group’s annual Search Solutions conference, which brings together theoreticians and practitioners to discuss the latest advances in search.

The day started with a talk by John Tait on the challenges of patent search where different units are concerned – where for example a search for a plastic with a melting point of 200°C wouldn’t find a patent that uses °F or Kelvin. John presented a solution from max.recall, a plugin for Apache Solr that promises to solve this issue. We then heard from Lewis Crawford of the UK Web Archive on their very large index of 240m archived webpages – some great features were shown including a postcode-based browser. The system is based on Apache Solr and they are also using ‘big data’ projects such as Apache Hadoop – which by the sound of it they’re going to need as they’re expecting to be indexing a lot more websites in the future, up to 4 or 5 million. The third talk in this segment came from Toby Mostyn of Polecat on their MeaningMine social media monitoring system, again built on Solr (a theme was beginning to emerge!). MeaningMine implements an iterative query method, using a form of relevance feedback to help users contribute more useful query information.

Before lunch we heard from Ricardo Baeza-Yates of Yahoo! on moving beyond the ‘ten blue links’ model of web search, with some fascinating ideas around how we should consider a Web of objects rather than web pages. Gabriella Kazai of Microsoft Research followed, talking about how best to gather high-quality relevance judgements for testing search algorithms, using crowdsourcing systems such as Amazon’s Mechanical Turk. Some good insights here as to how a high-quality task description can attract high-quality workers.

After lunch we heard from Marianne Sweeney with a refreshingly candid treatment of how best to tune enterprise search products that very rarely live up to expectations – I liked one of her main points that “the product is never what was used in the demo”. Matt Taylor from Funnelback followed with a brief overview of his company’s technology and some case studies.

The last section of the day featured Iain Fletcher of Search Technologies on the value of metadata and on their interesting new pipeline framework, Aspire. (As an aside, Iain has also joined the Pipelines meetup group I set up recently). Next up was Jared McGinnis of the Press Association on their work on Semantic News – it was good to see an openly available news ontology as a result. Ian Kegel of British Telecom came next with a talk about TV program recommendation systems, and we finished with Kristian Norling‘s talk on a healthcare information system that he worked on before joining Findwise. We ended with a brief Fishbowl discussion which asked amongst other things what the main themes of the day had been – my own contribution being “everyone’s using Solr!”.

It’s rare to find quite so many search experts in one room, and the quality of discussions outside the talks was as high as the quality of the talks themselves – congratulations are due to the organisers for putting together such an interesting programme.

The post Search Solutions 2011 review appeared first on Flax.

Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup

Charlie Hull — Thu, 17 Feb 2011 10:31:14 +0000

Last night we held the first of what we hope will be a series of Meetups in our home town of Cambridge, U.K. Attending were researchers, developers and entrepreneurs in the field of search – as is the norm in Cambridge many had cycled to the venue, and there was a friendly and informal feel to the group.

We started with my presentation on “Searching news media with open source software”, where I talked about our work for the NLA, Financial Times and others. We followed with John Snyder of Grapeshot on “Using Search to Connect Multiple Variants of An Object to One Central Object”. John showed a Grapeshot project for Virgin where different media assets can be automatically grouped together even if they have different metadata – for example an episode of the TV show “Heroes” is basically the same object whether it is broadcast, video-on-demand or a repeat, but differs from the Bowie album of the same name.

We then broke up for discussion (and beer) – great to catch up with some ex-colleagues and meet others for the first time. Downstairs there was live music and one of our colleagues even joined the band for a spell on drums! From the feedback we recieved there’s definitely interest in repeating the event, if you’d like to attend next time please join the Meetup group.

The post Bicycles, beer and bands – the first Cambridge Enterprise Search Meetup appeared first on Flax.

Next-generation media monitoring with open source search

Charlie Hull — Mon, 13 Dec 2010 14:05:40 +0000

Media monitoring is not a traditional search application: for a start, instead of searching a large number of documents with a single query, a media monitoring application must search every incoming news story with potentially thousands of queries, searching for words and terms relevant to client requirements. This can be difficult to scale, especially when accuracy must be maintained – a client won’t be happy if their media monitors miss relevant stories or send them news that isn’t relevant.

We’ve been working with Durrants Ltd. of London for a while now on replacing their existing (closed source) search engine with a system built on open source. This project, which you can read more about in a detailed case study (PDF), has reduced the hardware requirements significantly and led to huge accuracy improvements (in some cases where 95% of the results passed through to human operators were irrelevant ‘false positives’, the new system is now 95% correct).

The new system is built on Xapian and Python and supports all the features of the previous engine, to ease migration – it even copes with errors introduced during automated scanning of printed news. The new system scales easily and cost effectively.

As far as we know this is one of the first large-scale media monitoring systems built on open source, and a great example of search as a platform, which we’ve discussed before.

The post Next-generation media monitoring with open source search appeared first on Flax.

Autumn events

Charlie Hull — Fri, 10 Sep 2010 15:41:03 +0000

Autumn seems to be conference season: first is the Lucene Revolution event in Boston, USA from October 7th-8th, where I’ll be on the closing panel whose subject is “Data Crossroads – At The Intersection Of Search And Open Source”.

Next is the British Computer Society’s Search Solutions 2010 in London on October 21st, where I’m giving a presentation titled “What’s the story with open source? – Searching and monitoring news media with open-source technology”.

Both events feature a wide range of other speakers from organisations such as Cisco, LinkedIn, Twitter, Google and Microsoft.

The post Autumn events appeared first on Flax.

The Times they are a-changing….

Charlie Hull — Fri, 26 Mar 2010 11:44:38 +0000

News International have announced they will be charging for access to their Times and Sunday Times newspaper websites within a few months. At the same time we have the announcement that the Independent newspaper is to be bought by a Russian oligarch, and may end up as a free publication. This divergence of business models is interesting, but what concerns us at Flax is how technology will help newspaper websites differentiate themselves.

The NLA’s ClipShare and ClipSearch services, which are powered by Flax, are good models for monetizing newspaper content, and are already in use at some of the U.K.’s largest publishers. If you need to quickly find a particular story, see related articles and grasp an overview of coverage you need scalable, highly accurate search technology. Users have been conditioned to expect search to ‘just work’, and they simply won’t pay for anything that doesn’t come up to scratch.

The post The Times they are a-changing…. appeared first on Flax.