conference – Flax

Lifting the hood of AI – to find a search engine?

Charlie Hull — Fri, 14 Sep 2018 09:56:49 +0000

A few years ago much marketing noise was made about Big Data. Every software vendor suddenly had a Big Data suite; you could suddenly buy Big Data capable hardware; consultants and experts would release thought pieces, blogs and books all about Big Data and how it would change the world. The reality of course was slightly different: Big Data meant…well, it meant whatever you wanted it to mean for your commercial purpose. For some people, what didn’t fit in an Excel spreadsheet was Big Data, for others with actually large collections of data to process it was often hard to sort the wheat from the PR chaff and find a solution that worked.

Those of us in the search engine sector would occasionally mention that we’d been dealing with not inconsequential amounts of data for many years (for example, the founders of Flax met while building a half-billion-page web search engine back in 1999). We already knew something about distributed computing, clusters of servers and how to scale for performance and reliability. There’s even some shared history: Hadoop, the foundation of so many Big Data architectures, was created by the same person who created the search library Lucene and the web crawler Nutch – so he could build a big search engine. As a result we ended up with suites of Big Data-capable software where the clever bit was… search technology.

We’re at a similar point now with AI. No matter how many pictures of humanoid robots they use, what people are calling AI is not the Terminator or a robot companion built by a reclusive billionaire. It’s generally a combination of techniques such as machine learning (ML) and natural language processing (NLP), some of which have been around for decades, which can (if you get them right) spot patterns in data, recognise graphical shapes, analyze human speech etc. Getting them right is the hard bit – you need good, reliable signals; models that work and most importantly clever people to put it together (and few of these people are available).

Again, some of the most interesting (and more likely to be real, rather than just a dodgy prototype thrown together in the hope that Google will buy your startup) work is happening in the world of search, where the underlying and necessary fundamentals of large-scale data processing, text processing, user interaction and matching are well understood through decades of experience. Here, AI techniques can be applied with practical results – for example, Learning to Rank which cleverly re-orders search results based on signals important to the business or user. So again, underneath the current trend we find a dependence on search technology. It’s unfortunate that some commentators have assumed that this means that everything in search is powered by magic AI – rather the reverse in some cases.

Activate, a conference previously known as Lucene Revolution and run by our partners Lucidworks, has brought together AI and search deliberately to explore these connections. We’re looking forward to attending next month – come and find us if you want to discuss your project!

The post Lifting the hood of AI – to find a search engine? appeared first on Flax.

Three weeks of search events this October from Flax

Charlie Hull — Tue, 04 Sep 2018 10:11:56 +0000

Flax has always been very active at conferences and events – we enjoy meeting people to talk about search! With much of our consultancy work being carried out remotely these days, attending events is a great way to catch up in person with our clients, colleagues and peers and to learn from others about what works (and what doesn’t) when building cutting-edge search solutions. I’m thus very glad to announce that we’re running three search events this coming October.

Earlier in the year I attended Haystack in Charlottesville, one of my favourite search conferences ever – and almost immediately began to think about whether we could run a similar event here in Europe. Although we’ve only had a few months I’m very happy to say we’ve managed to pull together a high-quality programme of talks for our first Haystack Europe event, to be held in London on October 2nd. The event is focused on search relevance from both a business and a technical perspective and we have speakers from global retailers and by specialist consultants and authors. Tickets are already selling well and we have limited space, so I would encourage you to register as soon as you can (Haystack USA sold out even after the capacity was increased). We’re running the event in partnership with Open Source Connections.

The next week we’re running a Lucene Hackday on October 9th as part of our London Lucene/Solr Meetup programme. Building on previous successful events, this is a day of hacking on the Apache Lucene search engine and associated software such as Apache Solr and Elasticsearch. You can read up on what we achieved at our last event a couple of years ago – again, space is limited, so sign up soon to this free event (huge thanks to Mimecast for providing the venue and to Elastic for sponsoring drinks and food for an evening get-together afterwards). Bring a laptop and your ideas (and do comment on the event page if you have any suggestions for what we should work on).

We’ll be flying to Montreal soon afterwards to attend the Activate conference (run by our partners Lucidworks) and while we’re there we’ll host another free Lucene Hackday on October 15th. Again, this would not be possible without sponsorship and so thanks must go to Netgovern, SearchStax and One More Cloud. Remember to tell us your ideas in the comments.

So that’s three weeks of excellent search events – see you there!

The post Three weeks of search events this October from Flax appeared first on Flax.

Haystack, the search relevance conference – day 2

Charlie Hull — Mon, 23 Apr 2018 15:23:56 +0000

Two weeks ago I attended the Haystack relevance conference – I’ve already written about my overall impressions and on the first day’s talks but the following are some more notes on the conference sessions. Note that some of the presentations I attended have already been covered in detail by Sujit Pal’s excellent blog. Some of the presentations I haven’t linked to directly have now appeared on the conference website.

The second day of the event started for me with the enjoyable job of hosting a ‘fishbowl’ style panel session titled “No, You Don’t Want to Do It Like That! Stories from the search trenches”. The idea was that a rotating panel of speakers would tell us tales of their worst and hopefully most instructive search tuning experiences and we heard some great stories – this was by its nature an informal session and I don’t think anyone kept any notes (probably a good idea in the case of commercial sensitivity!).

The next talk was my favourite of the conference, given by René Kriegler on relevance scoring using product data and image recognition. René is an expert on e-commerce search (he also runs the MICES event in Berlin which I’m looking forward to) and described how this domain is unlike many others: the interests of the consumer (e.g. price or availability) becoming part of the relevance criteria. One of the interesting questions for e-commerce applications is how ranking can affect profit. Standard TF/IDF models don’t always work well for e-commerce data with short fields, leading to a score that can be almost binary: as he said ‘a laptop can’t be more laptop-ish than another’. Image recognition is a potentially useful technique and he demonstrated a way to take the output Google’s Inception machine learning model and use it to enrich documents within a search index. However, there can be over 1000 vectors output from this model and he described how a technique called random projection trees can be used to partition the vector space and thus produce simpler data for adding to the index (I think this is basically like slicing up a fruitcake and recording whether a currant was one side of the knife or the other, but that may not be quite how it works!). René has built a Solr plugin to implement this technique.

Next I went to Matt Overstreet’s talk on Vespa, a recently open sourced search and Big Data library from Oath (a part of Yahoo! Inc.). Matt described how Vespa could be used to build highly scalable personalised recommendation, search or realtime data display applications and took us through how Vespa is configured through a series of APIs and XML files. Interestingly (and perhaps unsurprisingly) Vespa has very little support for languages other than English at present. Queries are carried out through its own SQL-like language, YQL, and grouping and data aggregation functions are available. He also described how Vespa can use multidimensional arrays of values – tensors, for example from a neural network. Matt recommended we all try out Vespa – but on a cloud service not a low-powered laptop!

Ryan Pedala was up next to talk about named entity recognition (NER) and how it can be used to annodate or label data. He showed his experiments with tools including Prodigy and a custom GUI he had built and compared various NER libraries such Stanford NLP and OpenNLP and referenced an interesting paper on NER for travel-related queries. I didn’t learn a whole lot of new information from this talk but it may have been useful to those who haven’t considered using NER before.

Scott Stultz talked next on how to integrate business rules into a search application. He started with examples of key performance indicators (KPIs) that can be used for search – e.g. conversion ratios or average purchase values and how these should be tied to search metrics. They can then be measured both before and after changes are made to the search application: automated unit tests and more complex integration tests should also be used to check that search performance is actually improving. Interestingly for me he included within the umbrella of integration tests such techniques as testing the search with recent queries extracted from logs. He made some good practical points such as ‘think twice before adding complexity’ and that good autocomplete will often ‘cannibalize’ existing search as users simply choose the suggested completion rather than finishing typing the entire query. There were some great tips here for practical business-focused search improvements.

I then went to hear John Kane’s talk about interleaving for relevancy tuning which covered a method for updating a machine learning model in real-time using feedback from the current ranking powered by this model – simply by interleaving the results from two versions of this model. This isn’t a particularly new technique and the talk was somewhat of a product pitch for 904Labs, but the technique does apparently work and some customers have seen a 30% increase in conversion rate.

The last talk of the day came from Tim Allison on an evaluation platform for Apache Tika, a well-known library for text extraction from a variety of file formats. Interspersed with tales of ‘amusing’ and sometimes catastrophic ways for text extraction to fail, Tim described how tika-eval can be used to test how good Tika is at extracting data and output a set of metrics e.g. how many different MIME file types were found. The tool is now used to run regular regression tests for Tika on a dataset of 3 million files from the CommonCrawl project. We’re regular users of Tika at Flax and it was great to hear about the project is moving forward.

Doug Turnbull finished the conference with a brief summing up and thanks. There was a general feeling in the room that this conference was the start of something big and people were already asking when the next event would be! One of my takeaways from the event was that even though many of the talks used open source tools (perhaps unsurprisingly as it is so much easier to talk about these publically) the relevance tuning techniques and methods described can be applied to any search engine. The attendees were from a huge variety of companies, large and small, open and closed source based. This was an event about relevance engineering, not technology choices.

Thanks to all at OSC who made the event possible and for inviting us all to your home town – I think most if not all of us would happily visit again.

The post Haystack, the search relevance conference – day 2 appeared first on Flax.

Haystack, the search relevance conference – day 1

Charlie Hull — Wed, 18 Apr 2018 12:53:41 +0000

Last week I attended the Haystack relevance conference – I’ve already written about my overall impressions but the following are some more notes on the conference sessions. Note that some of the presentations I attended have already been covered in detail by Sujit Pal’s excellent blog. Those presentations I haven’t linked to directly should appear soon on the conference website.

Doug Turnbull of Open Source Connections gave the keynote presentation which led on the idea that we need more open source tools and methods for tuning relevance, including those to gather search analytics. He noted how the Learning to Rank plugins recently developed for both Solr and Elasticsearch have provided commoditized capabilities previously only described by academia and how we also need to build a cohesive community around search relevance. As it turned out, this conference did in my view signal the birth of that community.

Next up was Peter Fries who talked about a business-friendly approach to search quality, a subject close to my heart as I regularly have to discuss relevance tuning with non-technical staff. Peter described how search quality is often presented to business teams as mysterious and ‘not for them’ – without convincing these people of the value of search tuning we will fail to take account of business-related factors (and we’re also unlikely to get full buy-in for a relevance tuning project). He went on to say how it is important to include the marketing and management mindsets in this process and a method for search tuning involving feedback loops and an ‘iron triangle’ of measurement, data and optimisation. This was a very useful talk.

I then went to hear Chao Han of Lucidworks demonstrate how their product Fusion App Studio allows one to capture various signals and use these for ‘head and tail analysis’ – looking not just at the ‘head’ of popular, often-clicked results but those in the ‘tail’ that attract few clicks, possibly due to problems such as mis-spellings. Interestingly this approach allows automatic tail query rewriting – an example might be spotting a colour word such as ‘red’ in the query and rewriting this into a field query of colour:red. This was a popular talk although the presenter was a little mysterious about the exact methodology used, perhaps unsurprisingly as Fusion is a commercial product.

After a tasty Mexican-themed lunch I took a short break for some meetings, so missed the next set of talks. I then went to Elizabeth Haubert’s talk on Click Analytics. She began with a description of the venerable TREC conference (now in its 27th year!) which has evaluated relevance judgements and how these methods might be applied to real-world situations. For example, the TREC evaluations have shown that how relevance tests are assessed is as important as the tests themselves – the assessors are effectively also users of the system under test. She recommended calbrating both the rankings to a tester and the tester to the rankings, and to create a story around each test to put it in context and to help with disambiguation.

We finished the day with some lightning talks, sadly I didn’t take notes on these but check out Sujit’s aforementioned blog for more information. I do remember Tom Burgmans’ visualisation tool for Solr’s Explain debug feature which I’m very much looking forward to seeing as open source. The evening continued with a conference dinner nearby and some excellent local craft beer.

I’ll be covering the second day next.

The post Haystack, the search relevance conference – day 1 appeared first on Flax.

Haystack, the relevance conference – birth of a new profession?

Charlie Hull — Mon, 16 Apr 2018 15:34:13 +0000

I’ve just returned from Charlottesville, Virginia and the Haystack search relevance conference hosted by our partners Open Source Connections. The venues were their own office and the Random Row brewery next door – added once they realised that the event had outgrown its humble beginnings as a small, informal event for maybe 50 people into a professional conference for over twice that number with attendees from as far afield as the west coast of the US, Poland and of course the UK. I’ll be writing up each day of the event and what I learned from the talks in blogs to follow, but wanted to start with my overall impressions.

I don’t think I’ve been to any other conference with such a strong sense of community or such a high quality of presentations. It was particularly refreshing to be among a group of people with such a level of search expertise and experience that at no point did anything have to be ‘dumbed down’ or over-explained. The attendee list included open source committers from projects including Apache Lucene/Solr and Apache Tika, experts in commercial search, authors of books I’ve long regarded as essential for anyone working in this field, independent consultants and those working for huge global companies. The talks were well programmed, ran exactly to schedule and covered cutting-edge topics. Between these talks the networking was relaxed and friendly and I had a chance to get to know several people in real life that I’ve previously only connected with online.

I think this conference may also have signalled the birth of a new profession of “relevance engineer” – someone who can understand both the business and technical aspects of search relevance, work with a variety of underlying search engines and expertly use the correct tools for the job to drive a continuing process of search quality improvement. Personally, I learnt a huge amount of useful information, made connections with many others in our field and have pages of notes to follow up on.

Last but no means least is to extend my personal thanks to all at OSC who created, planned and ran the event – as a veteran of many events in both technical and non-technical fields I understand very well how much work goes into them, especially if you’re not an event planner by profession! You opened your doors to us and made us all feel very welcome and you all worked extremely hard to make this one of the best conferences I’ve ever attended.

More to follow on day 1 and day 2 soon.

The post Haystack, the relevance conference – birth of a new profession? appeared first on Flax.