closed source – Flax

Haystack, the search relevance conference – day 2

Charlie Hull — Mon, 23 Apr 2018 15:23:56 +0000

Two weeks ago I attended the Haystack relevance conference – I’ve already written about my overall impressions and on the first day’s talks but the following are some more notes on the conference sessions. Note that some of the presentations I attended have already been covered in detail by Sujit Pal’s excellent blog. Some of the presentations I haven’t linked to directly have now appeared on the conference website.

The second day of the event started for me with the enjoyable job of hosting a ‘fishbowl’ style panel session titled “No, You Don’t Want to Do It Like That! Stories from the search trenches”. The idea was that a rotating panel of speakers would tell us tales of their worst and hopefully most instructive search tuning experiences and we heard some great stories – this was by its nature an informal session and I don’t think anyone kept any notes (probably a good idea in the case of commercial sensitivity!).

The next talk was my favourite of the conference, given by René Kriegler on relevance scoring using product data and image recognition. René is an expert on e-commerce search (he also runs the MICES event in Berlin which I’m looking forward to) and described how this domain is unlike many others: the interests of the consumer (e.g. price or availability) becoming part of the relevance criteria. One of the interesting questions for e-commerce applications is how ranking can affect profit. Standard TF/IDF models don’t always work well for e-commerce data with short fields, leading to a score that can be almost binary: as he said ‘a laptop can’t be more laptop-ish than another’. Image recognition is a potentially useful technique and he demonstrated a way to take the output Google’s Inception machine learning model and use it to enrich documents within a search index. However, there can be over 1000 vectors output from this model and he described how a technique called random projection trees can be used to partition the vector space and thus produce simpler data for adding to the index (I think this is basically like slicing up a fruitcake and recording whether a currant was one side of the knife or the other, but that may not be quite how it works!). René has built a Solr plugin to implement this technique.

Next I went to Matt Overstreet’s talk on Vespa, a recently open sourced search and Big Data library from Oath (a part of Yahoo! Inc.). Matt described how Vespa could be used to build highly scalable personalised recommendation, search or realtime data display applications and took us through how Vespa is configured through a series of APIs and XML files. Interestingly (and perhaps unsurprisingly) Vespa has very little support for languages other than English at present. Queries are carried out through its own SQL-like language, YQL, and grouping and data aggregation functions are available. He also described how Vespa can use multidimensional arrays of values – tensors, for example from a neural network. Matt recommended we all try out Vespa – but on a cloud service not a low-powered laptop!

Ryan Pedala was up next to talk about named entity recognition (NER) and how it can be used to annodate or label data. He showed his experiments with tools including Prodigy and a custom GUI he had built and compared various NER libraries such Stanford NLP and OpenNLP and referenced an interesting paper on NER for travel-related queries. I didn’t learn a whole lot of new information from this talk but it may have been useful to those who haven’t considered using NER before.

Scott Stultz talked next on how to integrate business rules into a search application. He started with examples of key performance indicators (KPIs) that can be used for search – e.g. conversion ratios or average purchase values and how these should be tied to search metrics. They can then be measured both before and after changes are made to the search application: automated unit tests and more complex integration tests should also be used to check that search performance is actually improving. Interestingly for me he included within the umbrella of integration tests such techniques as testing the search with recent queries extracted from logs. He made some good practical points such as ‘think twice before adding complexity’ and that good autocomplete will often ‘cannibalize’ existing search as users simply choose the suggested completion rather than finishing typing the entire query. There were some great tips here for practical business-focused search improvements.

I then went to hear John Kane’s talk about interleaving for relevancy tuning which covered a method for updating a machine learning model in real-time using feedback from the current ranking powered by this model – simply by interleaving the results from two versions of this model. This isn’t a particularly new technique and the talk was somewhat of a product pitch for 904Labs, but the technique does apparently work and some customers have seen a 30% increase in conversion rate.

The last talk of the day came from Tim Allison on an evaluation platform for Apache Tika, a well-known library for text extraction from a variety of file formats. Interspersed with tales of ‘amusing’ and sometimes catastrophic ways for text extraction to fail, Tim described how tika-eval can be used to test how good Tika is at extracting data and output a set of metrics e.g. how many different MIME file types were found. The tool is now used to run regular regression tests for Tika on a dataset of 3 million files from the CommonCrawl project. We’re regular users of Tika at Flax and it was great to hear about the project is moving forward.

Doug Turnbull finished the conference with a brief summing up and thanks. There was a general feeling in the room that this conference was the start of something big and people were already asking when the next event would be! One of my takeaways from the event was that even though many of the talks used open source tools (perhaps unsurprisingly as it is so much easier to talk about these publically) the relevance tuning techniques and methods described can be applied to any search engine. The attendees were from a huge variety of companies, large and small, open and closed source based. This was an event about relevance engineering, not technology choices.

Thanks to all at OSC who made the event possible and for inviting us all to your home town – I think most if not all of us would happily visit again.

The post Haystack, the search relevance conference – day 2 appeared first on Flax.

A lack of cognition and some fresh FUD from Forrester

Charlie Hull — Wed, 14 Jun 2017 09:05:36 +0000

Last night the estimable Martin White, intranet and enterprise search expert and author of many books on the subject, flagged up two surprising articles from Forrester who have declared that Cognitive Search (we’ll define this using their own terms in a little while) is ‘overshadowing’ the ‘outmoded’ Enterprise Search, with a final dig at how much better commercial options are compared to open source.

Let’s start with the definition, helpfully provided in another post from Forrester. Apparently ‘Cognitive search solutions are different because they: Scale to handle a multitude of data sources and types’. Every enterprise search engine promises to index a multiplicity of content both structured and unstructured, so I can’t see why this is anything new. Next we have ‘Employ artificial intelligence technologies….natural language processing (NLP) and machine learning’. Again, NLP has been a feature of closed and open source enterprise search systems for years, be it for entity extraction, sentiment analysis or sentence parsing. Machine learning is a rising star but not always easy to apply to search problems. However I’m not convinced either of these are really ‘artificial intelligence’. Astonishingly, the last point is that Cognitive solutions ‘Enable developers to build search applications…provide SDKs, APIs, and/or visual design tools’. Every search engine needs user applications on top and has APIs of some kind, so this makes little sense to me.

Returning to the first article, we hear that indexing is ‘old fashioned’ (try building a search application without indexing – I’d love to know you’d manage that!) but luckily a group of closed-source search vendors have managed to ‘out-innovate’ the open source folks. We have the usual hackneyed ‘XX% of knowledge workers can’t find what they need’ phrases plus a sprinkling of ‘wouldn’t it be nice if everything worked like Siri or Amazon or Google’ (yes, it would, but comparing systems built on multi-billion-page Web indexes by Internet giants to enterprise search over at most a few million, non-curated, non-hyperlinked business documents is just silly – these are entirely different sets of problems). Again, we have mentions of basic NLP techniques like they’re something new and amazing.

The article mentions a group of closed source vendors who appear in Forrester’s Wave report, which like Gartner’s Magic Quadrant attempts to boil down what is in reality a very complex field into some overly simplistic graphics. Finishing with a quick dig at two open source companies (Elastic, who don’t really sell an enterprise search engine anyway, and Lucidworks whose Fusion 3 product really is a serious contender in this field, integrating Apache Spark for machine learning) it ignores the fact that open source search is developing at a furious rate – and there are machine learning features that actually work in practise being built and used by companies such as Bloomberg – and because they’re open source, these are available for anyone else to use.

To be honest It’s very difficult, if not impossible, to out-innovate thousands of developers across the world working in a collaborative manner. What we see in articles like the above is not analysis but marketing – a promise that shiny magic AI robots will solve your search problems, even if you don’t have a clear specification, an effective search team, clean and up-to-date content and all the many other things that are necessary to make search work well (to research this further read Martin’s books or the one I’m co-authoring at present – out later this year!). One should also bear in mind that marketing has to be paid for – and I’m pretty sure that the various closed-source vendors now providing downloads of Forrester’s report (because of course, they’re mentioned positively in it) don’t get to do so for free.

UPDATE: Martin has written three blog posts in response to both Gartner and Forrester’s recent reports which I urge you (and them) to read if you really want to know how new (or not) Cognitive Search is.

The post A lack of cognition and some fresh FUD from Forrester appeared first on Flax.

The closed-source topping on the open-source Elasticsearch

Charlie Hull — Tue, 28 Jan 2014 15:20:15 +0000

Today Elasticsearch (the company, not the software) announced their first commercial, closed-source product, a monitoring plugin for Elasticsearch (the software, not the company – yes I know this is confusing, one might suspect deliberately so). Amongst the raft of press releases there are a few small liberties with the truth, for example describing Elasticsearch (the company) as ‘founded in 2012 by the people behind the Elasticsearch and Apache Lucene open source projects’ – surely the latter project was started by Doug Cutting, who isn’t part of the aforementioned company.

Adding some closed-source dusting to a popular open-source distribution is nothing new of course – many companies do it, especially those that are venture funded – it’s a way of building intellectual property while also taking full advantage of the open-source model in terms of user adoption. Other strategies include curated distributions such as that offered by Heliosearch, founded by Solr creator Yonik Seeley and our partner LucidWorks‘ complete packaged search applications. It can help lock potential clients into your version of the software and your vision of the future, although of course they are still free to download the core and go it alone (or engage people like us to help do so), which helps them retain some control.

It’s going to be interesting to see how this strategy develops for Elasticsearch (for the last time, the company). At Flax we’ve also built various additional software components for search applications – but as we have no external investors to please these are freely available as open-source software, including Luwak our fast stored query engine, Clade a taxonomy/classification prototype and even some file format extractors.

The post The closed-source topping on the open-source Elasticsearch appeared first on Flax.