Media monitoring with open source search – 20 times faster than before!

We’re happy to announce we’ve just finished a successful project for a division of the Australian Associated Press to replace a closed source search engine with a considerably more powerful open source solution. You can read the press release here.

As our client had a large investment in stored searches (which represent a client’s interests) which were defined in the query language of their previous search engine, we first had to build a modified version of Apache Lucene that replicated exactly this syntax. I’ve previously blogged about how we did this. However this wasn’t the only challenge: search engines are designed to be good at applying a few queries to a very large document collection, not necessarily at applying tens of thousands of stored queries to every single new document. For media monitoring applications this kind of performance is essential as there may be hundreds of thousands of news articles to monitor every day. The system we’ve built is capable of applying tens of thousands of stored queries every second.

With the rapid increase in the volume of content that media monitoring companies have to check for their clients – today’s news isn’t just in print, but online, in social media and indeed multimedia – it may be that open source software is the only way to build monitoring systems that are economically scalable, while remaining accurate and flexible enough to deliver the right results to clients.

2 thoughts on “Media monitoring with open source search – 20 times faster than before!

  1. Congratulations on the successful project!

    I’m wondering how you did efficient reverse searches with Lucene. Scale up near real time searches?

    fyi, I did this reverse search in python https://github.com/shane42/psearch using an index. Something based on lucene would be great if you did use that approach and were prepared to open source it 🙂

  2. Thanks Shane. Nothing so clever, I’m afraid. It just indexes the incoming document to an in-memory DB and throws queries (cached in memory) at it sequentially. Initial tests on a Macbook demonstrated 50k qps, which was more than fast enough for the current requirements. However, we’re certainly interested in refinements for the next version, so thanks for the link – we’ll look into it.

Leave a Reply

Your email address will not be published. Required fields are marked *