machine learning – Flax

Activate 2018 day 2 – AI and Search in Montreal

Charlie Hull — Wed, 07 Nov 2018 12:09:38 +0000

I’ve already written about Day 1 of Lucidworks’ Activate conference; the second day started with a keynote on ‘moral code’, ethics & AI which unfortunately I missed, but a colleague reported that it was very encouraging to see topics such as diversity and inclusion raised in a keynote talk. Note that videos of some of the talks is starting to appear on Lucidworks’ Youtube channel.

Steve Rowe of Lucidworks gave a talk on what’s coming in Lucene/Solr 8 – a long list of improvements and new features from 7.x releases including autoscaling of SolrCloud clusters, better cross-datacentre replication (CDCR), time routed index aliases for time-series data, new replica types, streaming expressions, a JSON query DSL, better segment merge policies..it’s clear that a huge amount of work continues to go into Solr. In 8.x releases we’ll hopefully see HTTP/2 capability for faster throughput and perhaps Luke, the Lucene Index Toolbox, becoming part of the main project.

Cassandra Targett, also of Lucidworks, spoke about the Lucene/Solr Reference Guide which is now actually part of Solr’s source code in Asciidoc format. She had attempted to build this into a searchable, fully-hyperlinked documentation source using Solr itself but this quickly ran into issues with HTML tags and maintaining correct links. Lucidworks’ own Site Search did a lot better but the result still wasn’t perfect. Work remains to be done here but encouragingly in the last few weeks there’s also been some thinking about how to better document Solr’s huge and complex test suite on SOLR-12930. As Cassandra mentioned, effective documentation isn’t always the focus of Solr committers, but it’s essential for Solr users.

The next talk I caught came from Andrzej Bialecki on Solr’s autoscaling functionality and some impressive testing he’s done. Autoscaling analyzes your Solr cluster and makes suggestions about how to restructure it – which you can then do manually or automatically using other Solr features. These features are generally tested on collections of 1 billion documents – but Andrzej has manually tested them on 1 trillion simulated documents (yes, you read that right). Now that’s some scale!

The final talk I caught before the closing keynote was Chris ‘Hossman’ Hosstetter on How to be a Solr Contributor, amusingly peppered with profanity as is his usual style. There were a number of us in the room with some small concerns about Solr patches that have not been committed, and in general about how Solr might need more committers and how this might happen, but the talk mainly focused on how to generate new patches. He also mentioned how new features can have an unexpected cost, as they must then be maintained and might have totally unexpected consequences for other parts of the platform. Some of the audience raised questions about Solr tests (some of which regularly fail) – however since the conference Mark Miller has taken the lead on this under SOLR-12801 which is encouraging.

The closing keynote by Trey Grainger brought together the threads of search and AI – and also mentioned that if anyone had some spare server capacity, it would be fun to properly test Solr at trillion-document scale…

So in conclusion how did Activate compare to its previous incarnation as Lucene/Solr Revolution? Is search really the foundation of AI? Well, the talks I attended mainly focused on Solr features, but various colleagues heard about machine learning, learning-to-rank and self-aware machines, all of which is becoming easier to implement using Lucene/Solr. However, as Doug Turnbull writes if you’re thinking of a AI for search, you should be wary of the potential cost and complexity. There are no magic robots (Kevin Watters’ robot however, is rather wonderful!).

Huge thanks must go to all at Lucidworks for putting on such a well-organised and thought-provoking event and bringing together so many Lucene/Solr enthusiasts.

The post Activate 2018 day 2 – AI and Search in Montreal appeared first on Flax.

Lifting the hood of AI – to find a search engine?

Charlie Hull — Fri, 14 Sep 2018 09:56:49 +0000

A few years ago much marketing noise was made about Big Data. Every software vendor suddenly had a Big Data suite; you could suddenly buy Big Data capable hardware; consultants and experts would release thought pieces, blogs and books all about Big Data and how it would change the world. The reality of course was slightly different: Big Data meant…well, it meant whatever you wanted it to mean for your commercial purpose. For some people, what didn’t fit in an Excel spreadsheet was Big Data, for others with actually large collections of data to process it was often hard to sort the wheat from the PR chaff and find a solution that worked.

Those of us in the search engine sector would occasionally mention that we’d been dealing with not inconsequential amounts of data for many years (for example, the founders of Flax met while building a half-billion-page web search engine back in 1999). We already knew something about distributed computing, clusters of servers and how to scale for performance and reliability. There’s even some shared history: Hadoop, the foundation of so many Big Data architectures, was created by the same person who created the search library Lucene and the web crawler Nutch – so he could build a big search engine. As a result we ended up with suites of Big Data-capable software where the clever bit was… search technology.

We’re at a similar point now with AI. No matter how many pictures of humanoid robots they use, what people are calling AI is not the Terminator or a robot companion built by a reclusive billionaire. It’s generally a combination of techniques such as machine learning (ML) and natural language processing (NLP), some of which have been around for decades, which can (if you get them right) spot patterns in data, recognise graphical shapes, analyze human speech etc. Getting them right is the hard bit – you need good, reliable signals; models that work and most importantly clever people to put it together (and few of these people are available).

Again, some of the most interesting (and more likely to be real, rather than just a dodgy prototype thrown together in the hope that Google will buy your startup) work is happening in the world of search, where the underlying and necessary fundamentals of large-scale data processing, text processing, user interaction and matching are well understood through decades of experience. Here, AI techniques can be applied with practical results – for example, Learning to Rank which cleverly re-orders search results based on signals important to the business or user. So again, underneath the current trend we find a dependence on search technology. It’s unfortunate that some commentators have assumed that this means that everything in search is powered by magic AI – rather the reverse in some cases.

Activate, a conference previously known as Lucene Revolution and run by our partners Lucidworks, has brought together AI and search deliberately to explore these connections. We’re looking forward to attending next month – come and find us if you want to discuss your project!

The post Lifting the hood of AI – to find a search engine? appeared first on Flax.

Catching MICES – a focus on e-commerce search

Charlie Hull — Tue, 19 Jun 2018 14:15:55 +0000

The second event I attended in Berlin last week was the Mix Camp on e-commerce search (MICES), a small and focused event now in its second year and kindly hosted by Mytoys at their offices. Slides for the talks are available here and I hope videos will appear soon.

The first talk was given by Karen Renshaw of Grainger, who Flax worked with at RS Components (she also wrote a great series of blog posts for us on improving relevancy). Karen’s talk drew on her long experience of managing search teams from a business standpoint – this wasn’t about technology but about combining processes, targets and objectives to improve search quality. She showed how to get started by examining customer feedback, known issues, competitors and benchmarks; how to understand and categorise query types; create a test plan within a cross-functional team and to plan for incremental change. Testing was covered including how to score search quality and how to examine the impact of search changes, with the message that “all aspects of search should work together to help customers through their journey”. She concluded with the clear point that there are no silver bullets, and that expectations must be managed during an ongoing, iterative process of improvement. This was a talk to set the scene for the day and containing lessons for every search manager (and a good few search technologists who often ignore the business factors!).

Next up were Christine Bellstedt & Jens Kürsten from Otto, Germany’s second biggest online retailer with over 850,000 search queries a day. Their talk focused on bringing together the users and business perspective to create a search quality testing cycle. They quoted Peter Freis’ graphic from his excellent talk at Haystack to illustrate how they created an offline system for experimentation with new ranking methods based on linear combinations of relevance scores from Solr, business performance indicators and product availability. They described how they learnt how hard it can be to select ranking features, create test query sets with suitable coverage and select appropriate metrics to measure. They also talked about how the experimentation cycle can be used to select ‘challengers’ to the current ‘champion’ ranking method, which can then be A/B tested online.

Pavel Penchev of SearchHub was next and presented their new search event collector library – a Javascript SDK which can be used to collect all kinds of metrics around user behaviour and submit them directly to a storage or analytics system (which could even be a search engine itself – e.g. Elasticsearch/Kibana). This is a very welcome development – only a couple of months ago at Haystack I heard several people bemoaning the lack of open source tools for collecting search analytics. We’ll certainly be trying out this open source library.

Andreas Brückner of e-commerce search vendor Fredhopper talked about the best way to optimise search quality in a business context. His ten headings included “build a dedicated search team” – although 14% of Fredhoppers own customers have no dedicated search staff – “build a measurement framework” – how else can you see how revenue might be improved? and “start with user needs, not features”. Much to agree with in this talk from someone with long experience of the sector from a vendor viewpoint.

Johannes Peter of MediaMarktSaturn described an implementation of a ‘semantic’ search platform which attempts to understand queries such as ‘MyMobile 7 without contract’, recognising this is a combination of a product name, a Boolean operator and an attribute. He described how an ontology (perhaps showing a family of available products and their variants) can be used in combination with various rules to create a more focused query e.g. “title:(“MyMobile7″) AND NOT (flag:contract)”. He also mentioned machine learning and term co-occurrence as useful methods but stressed that these experimental techniques should be treated with caution and one should ‘fail early’ if they are not producing useful results.

Ashraf Aaref & Felipe Besson described their journey using Learning to Rank to improve search at GetYourGuide, a marketplace for activities (e.g. tours and holidays). Using Elasticsearch and the LtR plugin recently released by our partners OpenSourceConnections they tried to improve the results for their ‘location pages’ (e.g. for Paris) but their first iteration actually gave worse results than the current system and was thus rejected by their QA process. They hope to repeat the process using what they have learned about how difficult it is to create good judgement data. This isn’t the first talk I’ve seen that honestly admits that ML approaches to improving search aren’t a magic silver bullet and the work itself is difficult and requires significant investment.

Duncan Blythe of Zalando gave what was the most forward-looking talk of the event, showing a pure Deep Learning approach to matching search queries to results – no query parsing, language analysis, ranking or anything, just a system that tries to learn what queries match which results for a product search. This reminded me of Doug & Tommaso’s talk at Buzzwords a couple of days before, using neural networks to learn the journey between query and document. Duncan did admit that this technique is computationally expensive and in no way ready for production, but it was exciting to hear about such cutting-edge (and well funded) research.

Doug Turnbull was the last speaker with a call to arms for more open source tooling, datasets and relevance judgements to be made available so we can all build better search technology. He gave a similar talk to keynote the Haystack event two months ago and you won’t be surprised to hear that I completely agree with his viewpoint – we all benefit from sharing information.

Unfortunately I had to leave MICES at this point and missed the more informal ‘bar camp’ event to follow, but I would like to thank all the hosts and organisers especially René Kriegler for such an interesting day. There seems to be a great community forming around e-commerce search which is highly encouraging – after all, this is one of the few sectors where one can draw a clear line between improving relevance and delivering more revenue.

The post Catching MICES – a focus on e-commerce search appeared first on Flax.

A lack of cognition and some fresh FUD from Forrester

Charlie Hull — Wed, 14 Jun 2017 09:05:36 +0000

Last night the estimable Martin White, intranet and enterprise search expert and author of many books on the subject, flagged up two surprising articles from Forrester who have declared that Cognitive Search (we’ll define this using their own terms in a little while) is ‘overshadowing’ the ‘outmoded’ Enterprise Search, with a final dig at how much better commercial options are compared to open source.

Let’s start with the definition, helpfully provided in another post from Forrester. Apparently ‘Cognitive search solutions are different because they: Scale to handle a multitude of data sources and types’. Every enterprise search engine promises to index a multiplicity of content both structured and unstructured, so I can’t see why this is anything new. Next we have ‘Employ artificial intelligence technologies….natural language processing (NLP) and machine learning’. Again, NLP has been a feature of closed and open source enterprise search systems for years, be it for entity extraction, sentiment analysis or sentence parsing. Machine learning is a rising star but not always easy to apply to search problems. However I’m not convinced either of these are really ‘artificial intelligence’. Astonishingly, the last point is that Cognitive solutions ‘Enable developers to build search applications…provide SDKs, APIs, and/or visual design tools’. Every search engine needs user applications on top and has APIs of some kind, so this makes little sense to me.

Returning to the first article, we hear that indexing is ‘old fashioned’ (try building a search application without indexing – I’d love to know you’d manage that!) but luckily a group of closed-source search vendors have managed to ‘out-innovate’ the open source folks. We have the usual hackneyed ‘XX% of knowledge workers can’t find what they need’ phrases plus a sprinkling of ‘wouldn’t it be nice if everything worked like Siri or Amazon or Google’ (yes, it would, but comparing systems built on multi-billion-page Web indexes by Internet giants to enterprise search over at most a few million, non-curated, non-hyperlinked business documents is just silly – these are entirely different sets of problems). Again, we have mentions of basic NLP techniques like they’re something new and amazing.

The article mentions a group of closed source vendors who appear in Forrester’s Wave report, which like Gartner’s Magic Quadrant attempts to boil down what is in reality a very complex field into some overly simplistic graphics. Finishing with a quick dig at two open source companies (Elastic, who don’t really sell an enterprise search engine anyway, and Lucidworks whose Fusion 3 product really is a serious contender in this field, integrating Apache Spark for machine learning) it ignores the fact that open source search is developing at a furious rate – and there are machine learning features that actually work in practise being built and used by companies such as Bloomberg – and because they’re open source, these are available for anyone else to use.

To be honest It’s very difficult, if not impossible, to out-innovate thousands of developers across the world working in a collaborative manner. What we see in articles like the above is not analysis but marketing – a promise that shiny magic AI robots will solve your search problems, even if you don’t have a clear specification, an effective search team, clean and up-to-date content and all the many other things that are necessary to make search work well (to research this further read Martin’s books or the one I’m co-authoring at present – out later this year!). One should also bear in mind that marketing has to be paid for – and I’m pretty sure that the various closed-source vendors now providing downloads of Forrester’s report (because of course, they’re mentioned positively in it) don’t get to do so for free.

UPDATE: Martin has written three blog posts in response to both Gartner and Forrester’s recent reports which I urge you (and them) to read if you really want to know how new (or not) Cognitive Search is.

The post A lack of cognition and some fresh FUD from Forrester appeared first on Flax.

London Lucene/Solr Meetup – Learning to Rank and Hibernate Search

Charlie Hull — Wed, 24 Feb 2016 10:49:38 +0000

Back to the very impressive Bloomberg lecture theatre for this month’s Lucene/Solr Meetup, with an good turnout (I’m guessing 60-70 people). Our first talk came from Diego Ceccarelli of Bloomberg on how his team have created a Solr implementation of Learning to Rank, an improved way to rank search results using machine learning. Diego first took us through the basics of Lucene’s ranking methods, based on the venerable TF/IDF algorithm (although note that BM25 will be the default very soon). Bloomberg’s implementation first retrieves 1000 search results using standard TF/IDF (which is fast) and then extracts ‘features’ (a simple example might be ‘does the title match the search query?’) which are then fed to a machine learning model. This model is then used to re-rank the 1000 initial results and the top 10 supplied to the user. Interestingly, they have chosen to implement the features as Lucene queries, allowing for easy re-use. Initial tests have shown some metrics such as ‘clicks on the first result’ up by 10%, which is encouraging. There is now a Solr patch (SOLR-8542) which they hope to commit to Solr soon, and you can find slides and a video of a previous presentation on this topic online. I first heard about Learning to Rank from Microsoft Research some years ago and it’s great to see an open source implementation.

Next Sanne Grinovero of RedHat talked about Hibernate Search, an implementation of full-text search for users of this Java ORM. He gave us some great examples of how relational databases can be bad at full text search and thus the need for a full-text engine like Lucene. His implementation hides some of the finer details of Lucene but allows use of advanced Lucene API calls where necessary, and automatically keeps the Lucene index in sync with a relational database. A simple query DSL is available which he demonstrated in use for indexing and querying Twitter data. He then told us about Infinispan, a highly scalable key-value store which can also be used for storing Lucene indexes and mentioned ongoing work to add Elasticsearch and Solr integration.

We finished with a brief informal Q&A session outside; thanks to both presenters and to my co-hosts at Bloomberg for helping to organise the event. We hope to run another Meetup in a couple of months – as ever, offers of talks, a venue and sponsorship of snacks & drinks are very welcome!

The post London Lucene/Solr Meetup – Learning to Rank and Hibernate Search appeared first on Flax.

Outside the search box – when you need more than just a search engine

Charlie Hull — Tue, 06 Dec 2011 14:40:55 +0000

Core search features are increasingly a commodity – you can knock up some indexing scripts in whatever scripting language you like in a short time, build a searchable inverted index with freely available open source software, and hook up your search UI quickly via HTTP – this all used to be a lot harder than it is now (unfortunately some vendors would have you believe this is still the case, which is reflected in their hefty price tags).

However we’re increasingly asked to develop features outside the traditional search stack, to make this standard search a lot more accurate/relevant or to apply ‘search’ to non-traditional areas. For example, Named Entity Recognition (NER) is a powerful technique to extract entities such as proper names from text – these can then be fed back into the indexing process as metadata for each document. Part of Speech (POS) tagging tells you which words are nouns, verbs etc. Sentiment Analysis promises to give you some idea of the ‘tone’ of a comment or news piece – positive, negative or neutral for example, very useful in e-commerce applications (did customers like your product?). Word Sense Disambiguation (WSD) attempts to tell you the context a word is being used in (did you mean pen for writing or pen for livestock?).

There are commercial offerings from companies such as Nstein and Lexalytics that offer some of these features. An increasing amount of companies provide their services as APIs, where you pay-per-use – for example Thomson Reuters OpenCalais service, Pingar from New Zealand and WSD specialists SpringSense. We’ve also worked with open source tools such as Stanford NLP which perform very well when compared to commercial offerings (and can certainly compete on cost grounds). Gensim is a powerful package that allows for semantic modelling of topics. The Apache Mahout machine learning library allows for these techniques to be scaled to very large data sets.

These techniques can be used to build systems that don’t just provide powerful and enhanced search, but automatic categorisation and classification into taxonomies, document clustering, recommendation engines and automatic identification of similar documents. It’s great to be thinking outside the box – the search box that is!

The post Outside the search box – when you need more than just a search engine appeared first on Flax.