luwak – Flax

Search Solutions 2017 review

Charlie Hull — Thu, 14 Dec 2017 15:33:19 +0000

Search Solutions is one of my favourite search events of the year – small, focused and varied, with presentations from both the largest and smallest players in the world of search, drawn from both industry and academia.

This year’s event started with Edgar Meij of Bloomberg, who Flax have helped in the past with their large-scale search and alerting systems. I’d seen most of the details in this talk before so I won’t dwell on them but will thank Bloomberg again for their commitment and contributions to the open source community, particularly to Solr and our Luwak stored search library. Mark Fea of LexisNexis was up next with a talk about taxonomies and how they have built a semi-automated classification system combining supervised machine learning and Boolean rules-based systems: a pragmatic approach to combine the strengths of both approaches as machine learning isn’t always as clever as one might want, and Boolean rules can be hard to build and maintain. Like Bloomberg they are working at large scale: Mark mentioned taxonomies of 21,000 terms and 9 levels, applied to over 1 billion documents.

Mark Harwood of Elastic was up next with one of his always fascinating talks on discovering unknown patterns in data with Elasticsearch. He showed how he had explored ‘toxic’ content (far-right music and those who like it) and fake reviews on Amazon with some great visual demonstrations. An interesting conclusion was how ‘bad actors’ make strange, recognisable shapes in visualised data. [Mark later won the Best Presentation award, richly deserved!]. Anna Kolliakou of King’s College London spoke next on ‘veracity intelligence’ tools to help monitor terms connected to mental health across news media and social networks: an interesting example was ‘mephedrone’ around the time of reclassification of this particular recreational drug. Next up was independent consultant Phil Bradley with a detailed, well-researched and passionate talk on fake news and how one cannot trust any web search engine to present the full picture. Phil is obviously extremely concerned about this issue and his talk spurred discussion amongst the audience about how user education is essential to counter the usual viewpoint of ‘it’s on Google, it must be true’.

Coincidentally, Filip Radlinski of Google started the next session, describing a model for conversation information retrieval. He spoke about how the user and IR system reveal information about themselves as the conversation progresses, how the system may need a memory of past interactions and how it may present a set of potential answers. This is a useful model for the future, although most current ‘conversational’ systems are simplistic. Fabrizio Silvestri then spoke on the various types of search Facebook provides, mostly related to finding people but also images, video and news. He explained how every search operation needs to consider privacy and how Facebook use query rewriting to expand enhance the terms provided by the user. Nicola Cancedda of Microsoft was next with a talk on automated query extraction from emails, to help the user find and attach relevant documents in response (for example, after a colleague asks ‘can you send me the cost projections for 2017’). Her work involves training machine learning models after extracting candidate terms with high TF/IDF values from the email. [Interestingly this reminded me of work I carried out nearly 20 years ago on an email signature that when clicked would search for content relevant to the email – although this relied on Javascript working in an email client which is rather a security problem!].

Last of our scheduled talks was from Mark Stanger of Search Technologies (recently acquired by Accenture) about their work on Elsevier’s DataSearch platform. He described how they developed a Phrase Service that identifies phrases in the user’s query using various methods including acronym detection, dictionary lookup and natural language processing, then expands these phrases as necessary to provide enhanced search. After identifying these key terms they can be boosted appropriately for search (DataSearch itself is based on Solr).

The DataSearch project is impressive, and later on it won the Best Search Project award (I am proud to say I served as part of the judging panel for these awards this year). The other winner of most promising search startup Search|hub by CXP Commerce Experts GmbH.

We finished with some lightning talks and a brief Fishbowl session, dominated this time by discussions on Fake News and how it affects the world of search technology. Thanks to the BCS IRSG again for a fascinating and enlightening day.

The post Search Solutions 2017 review appeared first on Flax.

Worth the wait – Apache Kafka hits 1.0 release

Charlie Hull — Thu, 02 Nov 2017 09:50:20 +0000

We’ve known about Apache Kafka for several years now – we first encountered it when we developed a prototype streaming Boolean search engine for media monitoring with our own library Luwak. Kafka is a distributed streaming platform with some simple but powerful concepts – everything it deals with is a stream of data (like a messaging system), streams can be combined for processing and stored reliably in a highly fault-tolerant way. It’s also massively scalable.

For search applications, Kafka is a great choice for the ‘wiring’ between source data (databases, crawlers, flat files, feeds) and the search index and other parts of the system. We’ve used other message passing systems (like RabbitMQ) in projects before, but none have the simplicity and power of Kafka. Combine the search index with analysis and visualisation tools such as Kibana and you can build scalable, real-time systems for ingesting, storing, searching and analysing huge volumes of data – for example, we’ve already done this for clients in the financial sector wanting to monitor log data using open-source technology, rather than commercial tools such as Splunk.

The development of Kafka has been masterminded by our partners Confluent, and it’s a testament to this careful management that the milestone 1.0 version has only just appeared. This doesn’t mean that previous versions weren’t production ready – far from it – but it’s a sign that Kafka has now matured to be a truly enterprise-scale project. Congratulations to all the Kafka team for this great achievement.

We look forward to working more with this great software – and if you need help with your Kafka project do get in touch!

The post Worth the wait – Apache Kafka hits 1.0 release appeared first on Flax.

Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch

Charlie Hull — Thu, 28 Sep 2017 08:44:26 +0000

I finally made it to a London Elastic Meetup again after missing a few of the recent events: this time Rightmove were the hosts and the first speakers. They described how they had used Elasticsearch Percolator to run 3.5 million stored searches on new property listings as part of an overall migration from the Exalead search engine and Oracle database to a new stack based on Elasticsearch, Apache Kafka and CouchDB. After creating a proof-of-concept system on Amazon’s cloud they discovered that simply running all 3.5m Percolator queries every time a new property appeared would be too slow and thus implemented a series of filters to cut down the number of queries applied, including filtering out rental properties and those in the wrong location. They are now running around 40m saved searches per day and also plan to upgrade from their current Elasticsearch 2.4 system to the newer version 5, as well as carry out further performance improvements. After the talk I chatted to the presenter George Theofanous about our work for Bloomberg using our own library Luwak, which could be an way for Rightmove to run stored searches much more efficiently.

Next up was Signal Media, describing how they built an automated system for upgrading Elasticsearch after their cluster grew to over 60 nodes (they ingest a million articles a day and up to May 2016 were running on Elasticsearch 1.5 which had a number of issues with stability and performance). To avoid having to competely shut down and upgrade their cluster, Joachim Draeger described how they carried out major version upgrades by creating a new, parallel cluster (he named this the ‘blue/green’ method), with their indexing pipeline supplying both clusters and their UI code being gradually switched over to the new cluster once stability and performance were verified. This process has cut their cluster to only 23 nodes with a 50% cost saving and many performance and stability benefits. For ongoing minor version changes they have built an automated rolling upgrade system using two Amazon EBS volumes for each node (one is for the system, and is simply switched off as a node is disabled, the other is data and is re-attached to a new node once it is created with the upgraded Elasticsearch machine image). With careful monitoring of cluster stability and (of course) testing, this system enables them to upgrade their entire production cluster in a safe and reliable way without affecting their customers.

After the talks I announced the Search Industry Awards I’ll be helping to judge in November (please apply if you have a suitable search project or innovation!) and then spoke to Simone Scarduzio about his free Elasticsearch and Kibana security plugin, a great alternative to the Elastic X-Pack (only available to Elastic subscription customers). We’ll certainly be taking a deeper look at this plugin for our own clients.

Thanks again to Yann Cluchey for organising the event and all the speakers and hosts.

The post Elastic London Meetup: Rightmove & Signal Media and a new free security plugin for Elasticsearch appeared first on Flax.

A fabulous FactHack for Full Fact

Charlie Hull — Fri, 27 Jan 2017 10:49:20 +0000

Last week we ran a hackday for Full Fact, hosted by Facebook in their London office. We had planned to gather a room full of search experts from our London Lucene/Solr Meetup and around twenty people attended from a range of companies including Bloomberg, Alfresco and the European Bioinformatics Institute, including a number of Lucene/Solr committers.

Mevan Babakar of Full Fact has already written a detailed review of the day, but to summarise we worked on three areas:

Building a web service around our Luwak stored query engine, to give it an easy-to-use API. We now have an early version of this which allows Full Fact to check claims they have previously fact checked against a stream of incoming data (e.g. subtitles or transcripts of political events).
Creating a way to extract numbers from text and turn them into a consistent form (e.g. ‘eleven percent’, ‘11%’, ‘0.11’) so that we can use range queries more easily – Derek Jones’ team researched existing solutions and he has blogged about what they achieved.
Investigating how to use natural language processing to identify parts of speech and tag them in a Lucene index using synonyms and token stacking, to allow for queries such as ‘ is rising’ to match text like ‘crime is rising’ – the team forked Lucene/Solr to experiment with this.

We’re hoping to build on these achievements to continue to support Full Fact as they develop open source automated fact checking tools for both their own operations and for other fact checking organisations across the world (there were fact checkers from Argentina and Africa attending to give us an international perspective). Our thanks to all of those who contributed.

I’ve also introduced Full Fact to many others within the search and text analytics community and we would welcome further contributions from anyone who can lend their expertise and time – get in touch if you can help. This is only the beginning!

The post A fabulous FactHack for Full Fact appeared first on Flax.

Meetup at Big Data London – One-click Solr & Factchecking with Solr

Charlie Hull — Thu, 10 Nov 2016 11:22:26 +0000

Last week I spoke at the Big Data London conference, a very busy event with several thousand people attending. My session was on using open source search to make sense of Big Data – you can get slides here.

In the evening we ran another Lucene/Solr London Usergroup event with speakers Upayavira and Full Fact. After a brief but friendly fight with the Datastax team over pizza we settled down to see Upayavira show us his method for creating a fully functional SolrCloud stack and search application with a single command line using tools such as Docker, Rancher and Exhibitor. Upayavira’s system only needs to be given details of an Amazon Web Services cloud hosting account and it will create host instances, install and start Zookeeper, wait for a quorum to be established, install and start Solr and create a SolrCloud cluster and finally install and start a search application. The whole thing is managed by his own script Uberstack and is undeniably impressive.

Our second talk (and I think my favourite talk from all our Solr Meetups) was from Will Moy and Mevan Babakar of Full Fact, a charity who monitor the news for accuracy (something we increasingly require in these ‘post-truth’ days). Will told us how false and misleading claims can be amplified by the media and may end up directly influencing government policy, even though the underlying facts are wrong. FullFact are attempting to build open source, freely available systems for automating the factchecking process using Apache Lucene/Solr and our own stored query library Luwak and Flax have been donating some time to help them with this process. Their Hawk system currently indexes over 70 million sentences. This project is a wonderful example of how free, open source software can be used to create tools that benefit us all and at the end of this inspiring talk many of the audience offered ideas and even direct assistance with the project. I urge you to read Full Fact’s recent report on automated factchecking and get involved if you can. One idea was to run a Hackday for Full Fact – more details when we have them.

Thanks to Big Data London for inviting me to speak and hosting the Meetup and to Elsevier for sponsoring pizza and drinks. We’ll be back with another Meetup soon!

The post Meetup at Big Data London – One-click Solr & Factchecking with Solr appeared first on Flax.

Apache Kafka London Meetup – Real time search and insights

Charlie Hull — Thu, 14 Apr 2016 09:50:05 +0000

The rise of Apache Kafka as a streaming data solution is something we’ve been watching for a while – as part of a collection of Big Data tools, it provides a ‘TiVo for data‘ feature. We’ve begun to use it in client projects covering both search and log analysis and we’ve recently partnered with Confluent, founded by the creators of Kafka.

Last night we spoke at the Apache Kafka London Meetup – hosted by British Gas Connected Homes, it was well supplied with drinks, pizza and snacks and also very well attended – there was a great buzz of conversation before the talks had even started! Alan Woodward of Flax started with an updated talk about our proof-of-concept integration of Kafka, Apache Samza and our own Luwak streaming search library (slides are available here). This allows full-text search within a Kafka stream, with the search queries supplied as another stream, for a truly real-time solution – as opposed to the more usual (and much higher latency) approach of indexing the endpoint of a stream. Alan has also tried the very new Kafka Streams feature which can be used as an alternative to Apache Samza – there is some very early code available, although note that this still needs some work! (We’ll update this blog when it’s finished).

The second talk was by one of our hosts, Josep Casals, on how British Gas have used Kafka, Spark Streaming and Apache Cassandra to build a platform for analyzing data from smart meters, boilers and thermostats. Over 2 million smart meters are installed across the UK and there are also over 300,000 connected thermostats, plus many other data sources, and these devices can report every 30 minutes and 2 minutes respectively, so their system has to cope with around 30,000 messages/second. One interesting feature for me was how machine learning is used to disaggregrate power consumption data, so the consumption for say, a fridge can be split out from the overall figure. Apache Samza is also used in this system to provide estimates of consumption and interpolate between readings, allowing data to be fed back to an app on the customer’s mobile device. Further use cases include spotting outlier events, which might indicate failing heating devices or even unusual patterns in an elderly person’s home to alert relatives or carers.

Both talks were live streamed and you can watch them here.

We concluded with some informal discussion and a chance to meet some of Confluent’s UK-based team. Thanks to the organisers and hosts and we look forward to returning! If you have a Kafka project and you’d like any help or advice, do let us know.

The post Apache Kafka London Meetup – Real time search and insights appeared first on Flax.

Helping Bloomberg build a real-time news search engine with Luwak

Charlie Hull — Tue, 08 Mar 2016 11:13:36 +0000

Bloomberg is one of the world’s leading providers of financial news via the Bloomberg Terminal, an almost ubiquitous presence on the desks of finance professionals. As you might expect their systems heavily depend on effective search and over the last few years they have become increasingly involved in the open source community, sponsoring events such as Lucene Revolution and also helping me to run (and often hosting) the London Lucene/Solr Meetup. They also now employ no less than three Apache Solr committers and have contributed features including an XML query parser and analytics component.

The scale of Bloomberg’s systems is significant: 320,000 subscribers who carry out 8 million searches every day of an archive of 400 million stories. A million new stories are published every day and in the financial sector response time is paramount, so they want new stories available within 100 milliseconds.

One component of their platform is a large scale news alerting framework, handling around 1.5 million stored searches created both internally and by their subscribers. Some of these stored searches are highly complex Boolean expressions. As part of a migration away from a commercial solution, they have recently built a new alerting system based on the open source Luwak library we developed for media monitoring applications.

Initially, Luwak depended on a (rather large) patch for Lucene to add positional information to the index, but Bloomberg kindly funded the integration of this into trunk Lucene and as of version 5.3 it is part of the main release. We’ve also been working with them to develop and tune Luwak’s capabilities to address their performance and accuracy requirements.

Daniel Collins, who has led the alerting system development, recently talked in New York on the use of Luwak in their alerting system and you can watch the video of his talk and a short article covering their journey. He writes:

“Corporate technology has become highly complex. At the lower levels of the stack, innovators know that proprietary software can cause more problems than it solves. A lot of companies are deciding they can’t sit behind closed doors any more, and they need to get more involved in open source.”

We’re very grateful for Bloomberg’s support of the Luwak project and we are continuing to develop it – do let us know if you would like to know more about how to use it in your application.

The post Helping Bloomberg build a real-time news search engine with Luwak appeared first on Flax.

FIBEP WMIC 2015 – Open source search for media monitoring with Solr

Charlie Hull — Thu, 19 Nov 2015 16:23:46 +0000

FIBEP WMIC 2015 – How Infomedia upgraded their closed-source search engine to a fast, scalable and flexible open-source platform from Charlie Hull

The post FIBEP WMIC 2015 – Open source search for media monitoring with Solr appeared first on Flax.

Luwak 1.3.0 released

Alan Woodward — Tue, 17 Nov 2015 14:13:42 +0000

The latest version of Luwak, our open-source streaming query engine, has been released on the Sonatype Nexus repository and will be making its way to Maven Central in the next few hours. Here’s a summary of the new features and improvements we’ve made:

Batch processing

Inspired by a question raised during our talk at FOSDEM last February, you can now stream documents through the Luwak Monitor in batches, as well as one-at-a-time. This will generally improve your throughput, at the cost of a drop in latency. For example, local benchmarking against a set of 10,000 queries showed an improvement from 10 documents/second to 30 documents/second when the batch size was increased from 1 document to 30 documents; however, processing latency went from ~100ms for the single document to 10 seconds for the larger batch. You’ll need to experiment with batch sizes to find the right balance for your own use.

Presearcher performance improvements

Luwak speeds up document matching by filtering out queries that we can detect won’t match a given document or batch, a process we call presearching. Profiling revealed that creating the presearcher query was a serious performance bottleneck, particularly for presearchers using the WildcardNGramPresearcherComponent, so this has been largely rewritten in 1.3.0. We’ve seen improvements of up to 400% in query build times after this rewrite.

Concurrent query loading

Luwak now ships with a ConcurrentQueryLoader helper class to help speed up Monitor startup. The loader uses multiple threads to add queries to the index, allowing you to make use of all your CPUs when parsing and analyzing queries. Note that this requires your MonitorQueryParser implementations to be thread-safe!

Easier configuration and state monitoring

In 1.2.0 and earlier, clients had to extend the Monitor itself in order to configure the internal query caches or get state update information. Configuration has now been extracted into a QueryIndexConfiguration class, passed to the Monitor at construction, and you can get notified about updates to the query index by registering QueryIndexUpdateListeners.

For more information, see the CHANGES for 1.3.0. We’ll also be re-running the comparison with Elasticsearch Percolator soon, as this has also been improved as part of Elasticsearch’s recent 2.0 release.

The post Luwak 1.3.0 released appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.