Media Monitoring – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch http://www.flax.co.uk/blog/2018/02/01/finding-bad-actor-custom-scoring-forensic-name-matching-elasticsearch/ http://www.flax.co.uk/blog/2018/02/01/finding-bad-actor-custom-scoring-forensic-name-matching-elasticsearch/#respond Thu, 01 Feb 2018 10:13:56 +0000 http://www.flax.co.uk/?p=3681 Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch from Charlie Hull

The post Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch appeared first on Flax.

]]>

The post Finding the Bad Actor: Custom scoring & forensic name matching with Elasticsearch appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/02/01/finding-bad-actor-custom-scoring-forensic-name-matching-elasticsearch/feed/ 0
Worth the wait – Apache Kafka hits 1.0 release http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/ http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/#respond Thu, 02 Nov 2017 09:50:20 +0000 http://www.flax.co.uk/?p=3629 We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple … More

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple but powerful concepts – everything it deals with is a stream of data (like a messaging system), streams can be combined for processing and stored reliably in a highly fault-tolerant way. It’s also massively scalable.

For search applications, Kafka is a great choice for the ‘wiring’ between source data (databases, crawlers, flat files, feeds) and the search index and other parts of the system. We’ve used other message passing systems (like RabbitMQ) in projects before, but none have the simplicity and power of Kafka. Combine the search index with analysis and visualisation tools such as Kibana and you can build scalable, real-time systems for ingesting, storing, searching and analysing huge volumes of data – for example, we’ve already done this for clients in the financial sector wanting to monitor log data using open-source technology, rather than commercial tools such as Splunk.

The development of Kafka has been masterminded by our partners Confluent, and it’s a testament to this careful management that the milestone 1.0 version has only just appeared. This doesn’t mean that previous versions weren’t production ready – far from it – but it’s a sign that Kafka has now matured to be a truly enterprise-scale project. Congratulations to all the Kafka team for this great achievement.

We look forward to working more with this great software – and if you need help with your Kafka project do get in touch!

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/11/02/worth-wait-apache-kafka-hits-1-0-release/feed/ 0
Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/ http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/#respond Thu, 28 Sep 2017 08:44:26 +0000 http://www.flax.co.uk/?p=3613 I finally made it to a London Elastic Meetup again after missing a few of the recent events: this time Rightmove were the hosts and the first speakers. They described how they had used Elasticsearch Percolator to run 3.5 million … More

The post Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch appeared first on Flax.

]]>
I finally made it to a London Elastic Meetup again after missing a few of the recent events: this time Rightmove were the hosts and the first speakers. They described how they had used Elasticsearch Percolator to run 3.5 million stored searches on new property listings as part of an overall migration from the Exalead search engine and Oracle database to a new stack based on Elasticsearch, Apache Kafka and CouchDB. After creating a proof-of-concept system on Amazon’s cloud they discovered that simply running all 3.5m Percolator queries every time a new property appeared would be too slow and thus implemented a series of filters to cut down the number of queries applied, including filtering out rental properties and those in the wrong location. They are now running around 40m saved searches per day and also plan to upgrade from their current Elasticsearch 2.4 system to the newer version 5, as well as carry out further performance improvements. After the talk I chatted to the presenter George Theofanous about our work for Bloomberg using our own library Luwak, which could be an way for Rightmove to run stored searches much more efficiently.

Next up was Signal Media, describing how they built an automated system for upgrading Elasticsearch after their cluster grew to over 60 nodes (they ingest a million articles a day and up to May 2016 were running on Elasticsearch 1.5 which had a number of issues with stability and performance). To avoid having to competely shut down and upgrade their cluster, Joachim Draeger described how they carried out major version upgrades by creating a new, parallel cluster (he named this the ‘blue/green’ method), with their indexing pipeline supplying both clusters and their UI code being gradually switched over to the new cluster once stability and performance were verified. This process has cut their cluster to only 23 nodes with a 50% cost saving and many performance and stability benefits. For ongoing minor version changes they have built an automated rolling upgrade system using two Amazon EBS volumes for each node (one is for the system, and is simply switched off as a node is disabled, the other is data and is re-attached to a new node once it is created with the upgraded Elasticsearch machine image). With careful monitoring of cluster stability and (of course) testing, this system enables them to upgrade their entire production cluster in a safe and reliable way without affecting their customers.

After the talks I announced the Search Industry Awards I’ll be helping to judge in November (please apply if you have a suitable search project or innovation!) and then spoke to Simone Scarduzio about his free Elasticsearch and Kibana security plugin, a great alternative to the Elastic X-Pack (only available to Elastic subscription customers). We’ll certainly be taking a deeper look at this plugin for our own clients.

Thanks again to Yann Cluchey for organising the event and all the speakers and hosts.

The post Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/09/28/elastic-london-meetup-rightmove-signal-media-new-free-security-plugin-elasticsearch/feed/ 0
A fabulous FactHack for Full Fact http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/ http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/#respond Fri, 27 Jan 2017 10:49:20 +0000 http://www.flax.co.uk/?p=3412 Last week we ran a hackday for Full Fact, hosted by Facebook in their London office. We had planned to gather a room full of search experts from our London Lucene/Solr Meetup and around twenty people attended from a range … More

The post A fabulous FactHack for Full Fact appeared first on Flax.

]]>
Last week we ran a hackday for Full Fact, hosted by Facebook in their London office. We had planned to gather a room full of search experts from our London Lucene/Solr Meetup and around twenty people attended from a range of companies including Bloomberg, Alfresco and the European Bioinformatics Institute, including a number of Lucene/Solr committers.

Mevan Babakar of Full Fact has already written a detailed review of the day, but to summarise we worked on three areas:

  • Building a web service around our Luwak stored query engine, to give it an easy-to-use API. We now have an early version of this which allows Full Fact to check claims they have previously fact checked against a stream of incoming data (e.g. subtitles or transcripts of political events).
  • Creating a way to extract numbers from text and turn them into a consistent form (e.g. ‘eleven percent’, ‘11%’, ‘0.11’) so that we can use range queries more easily – Derek Jones’ team researched existing solutions and he has blogged about what they achieved.
  • Investigating how to use natural language processing to identify parts of speech and tag them in a Lucene index using synonyms and token stacking, to allow for queries such as ‘<noun> is rising’ to match text like ‘crime is rising’ – the team forked Lucene/Solr to experiment with this.

We’re hoping to build on these achievements to continue to support Full Fact as they develop open source automated fact checking tools for both their own operations and for other fact checking organisations across the world (there were fact checkers from Argentina and Africa attending to give us an international perspective). Our thanks to all of those who contributed.

I’ve also introduced Full Fact to many others within the search and text analytics community and we would welcome further contributions from anyone who can lend their expertise and time – get in touch if you can help. This is only the beginning!

The post A fabulous FactHack for Full Fact appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2017/01/27/fabulous-facthack-full-fact/feed/ 0
Out with the old – and in with the new Lucene query parser? http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/ http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/#respond Fri, 13 May 2016 12:41:52 +0000 http://www.flax.co.uk/?p=3273 Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of … More

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
Over the years we’ve dealt with quite a few migration projects where the query syntax of the client’s existing search engine must be preserved. This might be because other systems (or users) depend on it, or a large number of stored expressions exist and it is difficult or uneconomic to translate them all by hand. Our usual approach is to write a query parser, which understands the current syntax but creates a query suitable for a modern open source search engine based on Apache Lucene. We’ve done this for legacy engines including dtSearch and Verity and also for in-house query languages developed by clients themselves. This allows you to keep the existing syntax but improve performance, scalability and accuracy of your search engine.

There are a few points to note during this process:

  • What appears to be a simple query in your current language may not translate to a simple Lucene query, which may lead to performance issues if you are not careful. Wildcards for example can be very expensive to process.
  • You cannot guarantee that the new search system will return exactly the same results, in the same order, as the old one, no matter how carefully the query parser is designed. After all, the underlying search engine algorithms are different.
  • Some element of manual translation may be necessary for particularly large, complex or unusual queries, especially if the original intention of the person who wrote the query is unclear.
  • You may want to create a vendor-neutral query language as an intermediate step – so you can migrate more easily next time. We’ve done this for Danish media monitors Infomedia.
  • If you have particularly large and/or complex queries that may have been added to incrementally over time, they may contain errors or logistical inconsistencies – which your current engine may not be telling you about! If you find these you have two choices: fix the query expression (which may then give you slightly different results) or make the new system give the same (incorrect) results as before.

To mitigate these issues it is important to decide on a test set of queries and expected results, and what level of ‘correctness’ is required – bearing in mind 100% is going to be difficult if not impossible. If you are dealing with languages outside the experience of the team you should also make sure you have access to a native speaker – so you can be sure that results really are relevant!

Do let us know if you’re planning this kind of migration and how we can help – building Lucene query parsers is not a simple task and some past experience can be invaluable.

The post Out with the old – and in with the new Lucene query parser? appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/05/13/old-new-query-parser/feed/ 0
Helping Bloomberg build a real-time news search engine with Luwak http://www.flax.co.uk/blog/2016/03/08/helping-bloomberg-build-real-time-news-search-engine/ http://www.flax.co.uk/blog/2016/03/08/helping-bloomberg-build-real-time-news-search-engine/#respond Tue, 08 Mar 2016 11:13:36 +0000 http://www.flax.co.uk/?p=3058 Bloomberg is one of the world’s leading providers of financial news via the Bloomberg Terminal, an almost ubiquitous presence on the desks of finance professionals. As you might expect their systems heavily depend on effective search and over the last … More

The post Helping Bloomberg build a real-time news search engine with Luwak appeared first on Flax.

]]>
Bloomberg is one of the world’s leading providers of financial news via the Bloomberg Terminal, an almost ubiquitous presence on the desks of finance professionals. As you might expect their systems heavily depend on effective search and over the last few years they have become increasingly involved in the open source community, sponsoring events such as Lucene Revolution and also helping me to run (and often hosting) the London Lucene/Solr Meetup. They also now employ no less than three Apache Solr committers and have contributed features including an XML query parser and analytics component.

The scale of Bloomberg’s systems is significant: 320,000 subscribers who carry out 8 million searches every day of an archive of 400 million stories. A million new stories are published every day and in the financial sector response time is paramount, so they want new stories available within 100 milliseconds.

One component of their platform is a large scale news alerting framework, handling around 1.5 million stored searches created both internally and by their subscribers. Some of these stored searches are highly complex Boolean expressions. As part of a migration away from a commercial solution, they have recently built a new alerting system based on the open source Luwak library we developed for media monitoring applications.

Initially, Luwak depended on a (rather large) patch for Lucene to add positional information to the index, but Bloomberg kindly funded the integration of this into trunk Lucene and as of version 5.3 it is part of the main release. We’ve also been working with them to develop and tune Luwak’s capabilities to address their performance and accuracy requirements.

Daniel Collins, who has led the alerting system development, recently talked in New York on the use of Luwak in their alerting system and you can watch the video of his talk and a short article covering their journey. He writes:

Corporate technology has become highly complex. At the lower levels of the stack, innovators know that proprietary software can cause more problems than it solves. A lot of companies are deciding they can’t sit behind closed doors any more, and they need to get more involved in open source.

We’re very grateful for Bloomberg’s support of the Luwak project and we are continuing to develop it – do let us know if you would like to know more about how to use it in your application.

The post Helping Bloomberg build a real-time news search engine with Luwak appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/08/helping-bloomberg-build-real-time-news-search-engine/feed/ 0
London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/ http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/#respond Wed, 16 Dec 2015 16:21:32 +0000 http://www.flax.co.uk/?p=2860 This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and … More

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
This month’s London Text Analytics Meetup, hosted by Bloomberg in their spectacular Finsbury Square offices, was only the second such event this year, but crammed in three great talks and attracted a wide range of people from both academia and business. We started with Gabriella Kazai of Lumi, talking about how they have built a crowd-curated content platform for around 80,000 users whose interests and recommendations are mined so as to recommend content to others. Using Elasticsearch as a base, the system ingests around 100 million tweets a day and follows links to any quoted content, which is then filtered and analyzed using a variety of techniques including NLP and NER to produce a content pool of around 60,000 articles. I’ve been aware of Lumi since our ex-colleague Richard Boulton worked there but it was good to understand more about their software stack.

Next was Miguel Martinez-Alvarez of Signal, who are also dealing with huge amount of data on a daily basis – over a million documents a day from over 100,000 sources plus millions of blogs. Their ambition is to analyse “all the worlds’ news” and allow their users to create complex queries over this – “all startups in London working on Machine Learning” being one example. Their challenges include dealing with around 2/3rd of their ingested news articles being duplicates (due to syndicated content for example) and they have built a highly scalable platform, again with Elasticsearch a major part. Miguel talked in particular about how Signal work closely with academic researchers (including Professor Udo Kruschwitz of the University of Essex, with whom I will be collaborating next year) to develop cutting-edge analytics, with an Agile Data Science approach that includes some key evaluation questions e.g. Will it scale? Will the accuracy gain be worth the extra computing power?

Our last talk was from Miles Osborne of our hosts Bloomberg, who have recently signed a deal with Twitter to be able to ingest all past and forthcoming tweets – now that’s Big Data! The object of Miles’ research is to identify tweets that might affect a market and can thus be traded on, as early as possible after an event happens. His team have noticed that these tweets are often well-written (as opposed to the noise and abbreviations in most tweets) and seldom re-tweeted (no point letting your competitors know what you’ve spotted). Dealing with 500m tweets a day, they have developed systems to filter and route tweets into topic streams (which might represent a subject, location or bespoke category) using machine learning. One approach has been to build models using ‘found’ data (i.e. data that Bloomberg already has available) and to pursue a ‘simple is best’ methodology – although one model has 258 million features! Encouragingly, the systems they have built are now ‘good enough’ to react quickly enough to a crisis event that might significantly affect world markets.

We finished with networking, drinks and snacks (amply provided by our generous hosts) and I had a chance to catch up with a few old contacts and friends. Thanks to the organisers for a very interesting evening and the last event of this year for me – see you in 2016!

The post London Text Analytics Meetup – Making sense of text with Lumi, Signal & Bloomberg appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/london-text-analytics-meetup-making-sense-text-lumi-signal-bloomberg/feed/ 0
Out and about in search & monitoring – Autumn 2015 http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/ http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/#respond Wed, 16 Dec 2015 10:24:42 +0000 http://www.flax.co.uk/?p=2857 It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress … More

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/feed/ 0
FIBEP WMIC 2015 – Open source search for media monitoring with Solr http://www.flax.co.uk/blog/2015/11/19/fibep-wmic-2015-infomedia-upgraded-closed-source-search-engine-fast-scalable-flexible-open-source-platform/ http://www.flax.co.uk/blog/2015/11/19/fibep-wmic-2015-infomedia-upgraded-closed-source-search-engine-fast-scalable-flexible-open-source-platform/#respond Thu, 19 Nov 2015 16:23:46 +0000 http://www.flax.co.uk/?p=2812 FIBEP WMIC 2015 – How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform from Charlie Hull

The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.

]]>

The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/19/fibep-wmic-2015-infomedia-upgraded-closed-source-search-engine-fast-scalable-flexible-open-source-platform/feed/ 0
Luwak 1.3.0 released http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/ http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/#respond Tue, 17 Nov 2015 14:13:42 +0000 http://www.flax.co.uk/?p=2802 The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours.  Here’s a summary of the new features and … More

The post Luwak 1.3.0 released appeared first on Flax.

]]>
The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours.  Here’s a summary of the new features and improvements we’ve made:

Batch processing

Inspired by a question raised during our talk at FOSDEM last February, you can now stream documents through the Luwak Monitor in batches, as well as one-at-a-time. This will generally improve your throughput, at the cost of a drop in latency. For example, local benchmarking against a set of 10,000 queries showed an improvement from 10 documents/second to 30 documents/second when the batch size was increased from 1 document to 30 documents; however, processing latency went from ~100ms for the single document to 10 seconds for the larger batch. You’ll need to experiment with batch sizes to find the right balance for your own use.

Presearcher performance improvements

Luwak speeds up document matching by filtering out queries that we can detect won’t match a given document or batch, a process we call presearching. Profiling revealed that creating the presearcher query was a serious performance bottleneck, particularly for presearchers using the WildcardNGramPresearcherComponent, so this has been largely rewritten in 1.3.0. We’ve seen improvements of up to 400% in query build times after this rewrite.

Concurrent query loading

Luwak now ships with a ConcurrentQueryLoader helper class to help speed up Monitor startup. The loader uses multiple threads to add queries to the index, allowing you to make use of all your CPUs when parsing and analyzing queries. Note that this requires your MonitorQueryParser implementations to be thread-safe!

Easier configuration and state monitoring

In 1.2.0 and earlier, clients had to extend the Monitor itself in order to configure the internal query caches or get state update information. Configuration has now been extracted into a QueryIndexConfiguration class, passed to the Monitor at construction, and you can get notified about updates to the query index by registering QueryIndexUpdateListeners.

For more information, see the CHANGES for 1.3.0. We’ll also be re-running the comparison with Elasticsearch Percolator soon, as this has also been improved as part of Elasticsearch’s recent 2.0 release.

The post Luwak 1.3.0 released appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/11/17/luwak-1-3-0-released/feed/ 0