analytics – Flax

Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018

Charlie Hull — Mon, 18 Jun 2018 13:53:27 +0000

I spent last week in a sunny Berlin for the Berlin Buzzwords event (and subsequently MICES 2018, of which more later). This was my first visit to Buzzwords which was held in an arts & culture complex in an old brewery north of the city centre. The event was larger than I was expecting at around 550 people with three main tracks of talks. Although due to some external meetings I didn’t attend as many talks as I would have liked, here are a few highlights. Many of the talks have slides provided and some are now also available on the Buzzwords Youtube channel.

Giovanni Fernandez-Kincade talked about query understanding to improve both recall and precision for searches. He made the point that users and documents often speak very different languages which can lead to a lack of confidence in the search engine. Various techniques are available to attempt to translate the user’s intention into a suitable query and these can be placed on a spectrum from human-powered (e.g. creating an exception list to prevent stemming of proper nouns) to some degree of automation (e.g. harvesting data to build lists of synonyms) to fully automation (machine learning of how queries map to documents). Obviously these also fit on other scales from labour-intensive to hands-off and easy to hard in terms of the technology skills required. This talk gave a solid base understanding of the techniques available.

I dropped in on Suneel Marthi’s talk on detecting tulip fields from satellite images, which was fascinating although outside my usual area of search engine technology. I then heard Nick Burch describe the many ways that text extraction powered by Apache Tika can crash your JVM or even your entire cluster (potentially expensive in an elastically-scaling situation as more resources are automatically allocated!). As he recommended one should expect failure and plan accordingly, ship logs somewhere central for analysis and never run Tika inside your Solr instance itself in a production system (a recommendation that has finally made it to the Solr Wiki). Doug Turnbull and Tommaso Teofili then spoke on The Neural Search Frontier, a wide-ranging and in some places somewhat speculative discussion of techniques to improve ranking using word embeddings described by multidimensional vectors. This approach combined traditional IR techniques with neural models to learn whether a document is relevant to a query. One fascinating idea was the use of recurrent neural networks, much used in translation applications, to ‘translate’ a document to a predicted query. As with most of Doug’s talks this gave us a lot to think about but he finished with a plea for better native vector support in Lucene-based search engines.

The next talk I heard was from Varun Thacker on Solr autoscaling which I know is a particular concern of some of our clients as their data volumes grow. These new features in Solr version 7 allow policies and preferences to be set up to govern autoscaling behaviour, where shards may be moved and new cores created automatically based on metrics such as disk space or queries-per-second. One interesting line of questioning from the audience was how to avoid replicas from ‘ping ponging’ between hosts – e.g moving from a node with low disk space to one with more disk space, but then causing a reduction in disk space on the target node, leading to another move. Usefully the autoscaling system can be set to compute a list of operations but leave execution to a human operator, which may help prevent this problem.

The next day I attended Tomás Fernández Löbbe’s talk on new replica types in Solr 7, which talked about the advantages of the ‘Master/Slave’ model for search cluster design as opposed to the standard SolrCloud ‘every node does everything’ model. The new replica types PULL and TLOG allow one to build a master/slave setup in SolrCloud, separating responsibility for indexing and searching and even choosing which type of replica to use in queries. I also heard Houston Putman talk about data analytics with Solr, describing how built-in Solr functions can carry out the type of analytics previously only possible with Apache Spark or Hadoop and avoiding the extra cost of shipping data out of Solr. Unfortunately that was the end of my conference due to some other commitments but it was great to catch up with various search people from Europe and further abroad and to enjoy what was a well-organised and interesting event.

The post Highlights of Search, Store, Scale & Stream – Berlin Buzzwords 2018 appeared first on Flax.

Choosing between Elasticsearch and Solr

Charlie Hull — Tue, 15 Mar 2016 15:42:32 +0000

One of the questions we’re asked all the time is which of the two most popular open source search engines is best for a particular use case – and the answer is always ‘it depends’. Broadly speaking, Apache Lucene/Solr and Elasticsearch are very similar in terms of features and performance. If you’ve already chosen one of them, there’s very few reasons to incur the inevitable extra work of switching to the other. However if you’re still not sure which to choose, read on.

Solr, being the older and more mature project, is often chosen by organisations who are comfortable with building and managing enterprise Java applications, already using other Apache open source projects and whose data is generally complex and of many different sizes and types. The queries used may also be quite complex. They like the fact that there are many places to obtain support and development services from and the wide range of documentation, books and articles on Solr. They also like its proven stability at large scale. Some examples from our own client list are Bloomberg, News UK, Reed Specialist Recruitment and Infomedia.

Elasticsearch is often chosen by organisations who are following the latest trends in terms of programming languages and frameworks, aren’t necessarily thinking they need a ‘search engine’ (they may for example be building analytics tools) and are likely to be ingesting a large number of reasonably small items of data. They want a tool that will scale easily and automatically without them having to think about how to do it. Although there isn’t so much documentation on Elasticsearch, and the support and training options are pretty much centred around one company (who also control the development roadmap), they like the fact that it’s quick to get started with. Some examples from our own client list are Arachnys, the Office for National Statistics, Westcoast and Eagle Genomics.

There’s exceptions of course and several of our clients use both Solr and Elasticsearch in different situations and for different purposes. You can build site search, enterprise search, Big Data and analytics systems with either – and we’re happy to offer consulting, training and support for both. If you need advice or help with your choice do get in touch or come along to a Meetup and chat to us in person.

The post Choosing between Elasticsearch and Solr appeared first on Flax.

Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara

Charlie Hull — Thu, 18 Feb 2016 11:42:07 +0000

Last night I dropped in on the Unified Log Meetup at JustEat’s offices (of course, they provided lots of pizza for us all!). I’ve written about this Meetup before – as a rule the events cover logging and analytics at massive scale, with search being only part of the picture.

Joseph Francis from Skyscanner began with a talk about how they’ve developed a streaming data system to replace a monolithic SQL database for reporting and monitoring. Use cases include creating user timelines, data enrichment, JOINs and windowed aggregations and his team aim to provide a system that in-house developers can easily use for all kinds of analytics tasks. The system uses Apache Kafka as a highly scalable pipeline and Apache Samza for stream-based processing, as you can see (hopefully) in this photo of their architecture:
Elasticsearch provides querying capabilities and visualisations using Kibana. Joseph’s team have focused on making the system (and tasks that run on it) easy to deploy and use, with this currently managed using Ansible and TeamCity although they are now moving to a combination of Docker and Drone. As an aside, Skyscanner are also building autosuggest capabilities using Solr.

Next was Bruno Bonacci showing off his analytics system Samsara, inspired by a project to build analytics for Tesco’s HUDL tablet in only six weeks. With this short a timescale, Bruno took a pragmatic approach combining Kafka, Elasticsearch, Kibana and a number of custom components to allow relatively simple – but extremely fast – stream processing. He described how aggregation can either be done at ingestion time (which as you must store all the data you might need in separated chunks can end up taking up huge amounts of storage) or query time (which is far more flexible especially when you don’t yet know what questions you’ll need to answer). His custom processing module, Samsara Core, doesn’t use a built-in database for storing state (as Samza does) but rather uses an in-memory key-value store. For resiliency, this creates a log which is emitted as a Kafka stream. His approach seems to have huge performance implications – he has demonstrated Samsara running on a single core to be 72 times faster than a 4-core Spark Streaming system. Bruno and his team have released Samsara as open source and are working on new processing modules including sentiment analysis and classification. This is a fascinating project and a sign of the increasing need for high-performance streaming analytics. It would be interesting to see if our own work combining our stored query library Luwak with Samza could be combined with Samsara.

Thanks to Alex Dean of Snowplow for organising a very interesting evening and of course, to both the speakers.

The post Unified Log Meetup – Scaling up with Skyscanner, Samza and Samsara appeared first on Flax.

Enterprise Search Europe 2015: Fishing the big data streams – the future of search

Charlie Hull — Wed, 28 Oct 2015 12:09:52 +0000

Enterprise Search Europe 2015: Fishing the big data streams – the future of search from Charlie Hull

The post Enterprise Search Europe 2015: Fishing the big data streams – the future of search appeared first on Flax.

Searching & monitoring the Unified Log

Charlie Hull — Fri, 05 Dec 2014 11:10:25 +0000

This week I dropped into the Unified Log Meetup held at the rather hard to find offices of Just Eat (luckily there was some pizza left). The Unified Log movement is interesting and there’s a forthcoming book on the subject from Snowplow’s Alex Dean – the short version is this is all about massive scale logging of everything a business does in a resilient fashion and the eventual insights one might gain from this data. We’re considering streams of data rather than silos or repositories we usually index here, and I was interested to see how search technology might fit into the mix.

The first talk by Ian Meyers from AWS was about Amazon Kinesis, a hosted platform for durable storage of stream data. Kinesis focuses on durability and massive volume – 1 MB/sec was mentioned as a common input rate, and data is stored across multiple availability zones. The price of this durability is latency (from a HTTP PUT to the associated GET might be as much as three seconds) but you can be pretty sure that your data isn’t going anywhere unexpectedly. Kinesis also allows processing on the data stream and output to more permanent storage such as Amazon S3, or Elasticsearch for indexing. The analytics options allow for counting, bucketing and some filtering using regular expressions, for real-time stream analysis and dashboarding, but nothing particularly advanced from a search point of view.

Next up was Martin Kleppman (taking a sabbatical from LinkedIn and also writing a book) to talk about some open source options for stream handling and processing, Apache Kafka and Apache Samza. Martin’s slides described how LinkedIn handles 7-8 million messages a second using Kafka, which can be thought of an append-only file – to get data out again, you simply start reading from a particular place in the file, with all the reliable storage done for you under the hood. It’s a much simpler system than RabbitMQ which we’ve used on client projects at Flax in the past.

Martin explored how Samza can be used as a stream processing layer on top of Kafka, and even how oft-used databases can be moved into local storage within a Samza process. Interestingly, he described how a database can be expressed simply as a change log, with Kafka’s clever log compaction algorithms making this an efficient way to represent it. He then moved on to describe a prototype integration with our Luwak stored query library, allowing for full-text search within a stream, with the stored queries and matches themselves being of course just more Kafka streams.

It’s going to be interesting to see how this concept develops: the Unified Log movement and stream processing world in general seems to lack this kind of advanced text matching capability, and we’ve already developed Luwak as a highly scalable solution for some of our clients who may need to apply a million stored queries to a million new stories a day. The volumes discussed at the Meetup are a magnitude beyond that of course but we’re pretty confident Luwak and Samza can scale. Watch this space!

The post Searching & monitoring the Unified Log appeared first on Flax.

Convergence and collisions in Enterprise Search

Charlie Hull — Mon, 03 Mar 2014 14:45:25 +0000

At the end of next month I’ll be at Enterprise Search Europe (I’m on the programme committee and help with the open source track) and the opening keynote this year is from Dale Roberts, author of the book Decision Sourcing. Dale will be talking about how Social, Big Data, Analytics and Enterprise Search are on a collision course and business leaders ignore these four themes at their peril.

So I wondered if we could see how in practical terms one might build systems based on these four themes. There are technical and logistical challenges of course (not least convincing someone to pay for the effort) but it’s worth exploring nonetheless.

Social in a business context can mean many things: social media is inherently noisy (and as far as I can see mostly cats) but when social tools are used within a business they can be a great way to encourage collaboration. We ourselves have added social features to search applications – user tagging of search results for example, to improve relevance for future searches and to help with de-duplication. Much has been made of the idea of finding not just relevant documents, but the subject matter experts that may have written them, or just other people in your organisation who are interested in the same subject. From a technical point of view none of this is particularly hard – you just have to add these social signals to your index and surface them in some intuitive way – but getting a high enough percentage of users to contribute to shared discussions and participate in tagging can be difficult.

Big Data is an overused term – but in a business context people usually apply it to very large collections of log files or other data showing how your customers are interacting with your business. A lot of search engine experts will tell you that Big Data isn’t always that ‘big’ – we’ve been dealing with collections of hundreds of millions or even billions of indexed items for many years now, the trick is scaling your solution appropriately (not just in technical terms, but in an economic way, as linearly as possible). If you’ve got a few million items, I’m sorry but you haven’t got Big Data, you’ve just got some data.

I’ve always been unsure of the benefits of search Analytics but I’m beginning to change my mind, having seen a some very impressive demos recently. Search engines have always counted things; the clever bit is allowing for queries that can surface unusual or interesting information, and using modern visualisation techniques to show this. Knowing the most popular search term may not be as important as spotting an unexpected one.

So we’ve indexed our data including tags, personnel records, internal chatrooms; put them all onto a elastically scalable platform and built some intuitive and useful interfaces to search and analyze our data. I’m pretty sure you could do all this with the open source technologies we have today (including Scrapy, Apache Lucene/Solr, Elasticsearch, Apache Hadoop, Redis, Logstash, Kibana, JQuery, Python and Java). This isn’t the whole story though: you’d need a cross-disciplinary team within your organisation with the ability to gather user requirements and drive adoption, a suitable budget for prototyping, development and ongoing support and refinements to the system and a vision encompassing the benefits that it would bring your business. Not an inconsiderable challenge!

What questions should we be able to ask the system? I’ll leave that as an exercise for the reader.

See you in April! If you’d like a 20% discount on registration use the code HULL20. We’ll also be running an evening Meetup on Tuesday 29th April open to both conference attendees and others.

The post Convergence and collisions in Enterprise Search appeared first on Flax.