Business – Flax

Little Mermaids, Haystacks and moving on

Charlie Hull — Fri, 15 Feb 2019 09:47:25 +0000

As I announced recently Flax is joining OpenSource Connections, and I recently spent a very pleasant week in Virginia with my new colleagues discussing our plans for the year to come. Without giving too much away I can say that this is a very exciting time to be joining OSC: one thing I will be doing soon is starting to write more about OSC’s proven process for supporting our clients as they move up the search relevance curve.

However before then I’ll be at speaking at a few events. At the end of this month I’ll be in Copenhagen to speak on Keeping Search Relevant in a Digital Workplace at the Intrateam conference. This is a fantastic conference on intranets and I’m looking forward to speaking for the second time and joining a very august gathering of speakers. I’m also glad to be returning to both City University and the University of Essex during February and March to talk to students about working in search and information retrieval

In April I’ll be returning to the US for OSC’s Haystack search relevance conference, which was my favourite event of last year – I liked it so much I brought it to London that October. This year we have a fantastic lineup of talks from speakers representing organisations including LexisNexis, Wikimedia Foundation, Eventbrite and Yelp, a new and more capacious venue in downtown Charlottesville, three training options before the main conference (Think Like A Relevance Engineer for Elasticsearch and Solr, and Learning to Rank) and of course the chance to meet, chat with and get to know some of the best search people in the business. Earlybird tickets are available until the end of February and are already selling well, so make your plans to join us soon!

It’s already shaping up to be a busy year – so do keep an eye on this blog and my new home at www.opensourceconnections.com/blog for further news, and if you’d like to know how OSC can help you empower your search team get in touch.

The post Little Mermaids, Haystacks and moving on appeared first on Flax.

Flax joins OpenSource Connections

Charlie Hull — Fri, 21 Dec 2018 12:09:24 +0000

We have some news!

From February 1st 2019 Flax’s Managing Director Charlie Hull will be joining OpenSource Connections (OSC), Flax’s long-standing US partner, as a senior Managing Consultant. Charlie will manage a new UK division of OSC who will also acquire some of Flax’s assets and brands. OSC are a highly regarded organisation in the world of search and relevance, wrote the seminal book Relevant Search and run the popular Haystack relevance conference. Their clients include the US Patent Office, the Wikimedia Foundation and Under Armour and their services include comprehensive training, Discovery engagements, Trusted Advisor consulting and expert implementation.

Lemur Consulting Ltd., which as most of you will know trades as Flax, will continue to operate and to complete current projects but will not be taking on any new business after January 2019. For any new business we will be forwarding all future Flax enquiries to OSC where Charlie will as ever be very happy to discuss requirements and how OSC’s expert team (which may include some familiar faces!) might help.

We are all very excited about this new development as it will create a larger team of independent search & relevance experts with a global reach. We fully expect to build on Flax’s 17 year history of providing high quality search solutions as part of OSC. We intend to continue managing the London Lucene/Solr Meetup and running, attending and speaking at other events on search related topics.

If you have any questions about the above please do contact us. Merry Christmas and best wishes for the New Year!

The post Flax joins OpenSource Connections appeared first on Flax.

More needles, more Haystacks, more relevance!

Charlie Hull — Wed, 05 Dec 2018 11:28:31 +0000

Those of us who have been working in the search sector for a while know that search tuning isn’t just a matter of installing the default configuration, pointing the engine at some content and starting it up – in fact, if you do just that you’ll probably end up with a search user experience that’s even worse then whatever you’re replacing and certainly a lot worse than your competitors’ solution. It’s also no longer about just knowing how one engine behaves and the magic tweaks to improve it – you need to understand the fundamentals of search and how a range of different products and projects implement this. You also need to understand user requirements and their often entirely subjective views of what is a ‘good’ and ‘bad’ search result, plus how different types of businesses can use search technology for site search, enterprise search, media monitoring, process improvement and myriad of other uses.

Over the last year or so we’ve seen the emergence of a new profession dedicated to improving how search systems present information to users – Relevance Engineering. Importantly this covers not just the technical aspects of search, but the business aspects – understanding the why as much as the how. Relevance engineers understand that search tuning is a multifaceted problem and there are no magic bullets (or magic AI robots) that will do all the work for you. I’ve started to write about relevance engineering recently to try and define what it means.

One of my favourite events last year was the first Haystack conference run by our partners Open Source Connections, which brought together both experienced relevance engineers and those new to the profession. It was friendly, informal, focused and informative. In fact, I enjoyed it so much that by the second day I was already thinking about how to bring the event to Europe – which we did successfully in October.

I’m very happy to say that Haystack is back in April 2019 and the Call for Papers is open until January 9th. If you’ve got an exciting relevance project or idea to talk about please do submit it. See you there!

The post More needles, more Haystacks, more relevance! appeared first on Flax.

Defining relevance engineering part 3: technical assessment

Charlie Hull — Wed, 11 Jul 2018 09:49:11 +0000

In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

When Flax is working with clients on relevance tuning engagements we aim to gain an overview of the various technology the client uses and how it is obtained, deployed, managed and maintained. This will include not just the search engine but the various systems that supply data to it, host it, monitor it and interface to it to pass results to users. In addition we must understand who is responsible for the various areas, be it in-house staff, consultants, outsourcing or third party suppliers.

We try to answer the following questions in detail, including who supplies, modifies, maintains and supports the various systems concerned, what versions are used and where and how they are hosted and configured. We hope for full access to inspect the systems but this is not always possible – at the least, we need copies of configuration files and settings.

What systems supply the source data for search?
What is the current search technology?
Is the search engine part of another system (such as a content management system or product information system)?
What interface is there between the systems that supply source data and the search engine?
What systems monitor and manage the search engine?
What systems are used to submit queries to the search engine?
What query logging is performed and at what level?
How are development, test, staging and production systems arranged and what access is available to these?
What are the processes used to deploy new software and configuration?
What testing is performed?

It’s common to find flaws in the overall technical landscape – as an example, we’ll often find that there is no effective source control of search engine configuration files, with these having been originally derived from an example setup not intended for production use and since modified ad-hoc as issues arose. In this case it’s quite common that no-one knows why a particular setting has been used!

Without a good overall idea of the technology landscape it will be hard if not impossible to improve relevance. External processes (such as how hard it is to obtain a recent and complete log file from a production system) will also impact how effective these improvements will be.

Finally, as search is often owned by the IT department (and by the time we arrive, search is usually viewed as ‘broken’) we sometimes find a ‘bunker mentality’ – those responsible for the implementation are hunkered down and used to being harried and complained at by others who are unhappy with how search is (not) working. It’s important to communicate that only by being open and honest about the current situation can we all work together to improve things and build better search.

In the next post I’ll cover the tools a relevance engineer can use. In the meantime you can read the free Search Insights 2018 report by the Search Network. Of course, feel free to contact us if you need help with relevance engineering.

The post Defining relevance engineering part 3: technical assessment appeared first on Flax.

Lucene Solr London: Search Quality Testing and Search Procurement

Charlie Hull — Fri, 29 Jun 2018 11:09:34 +0000

Mimecast were our kind hosts for the latest London Lucene/Solr Meetup (and even provided goodie bags). It’s worth repeating that we couldn’t run these events without the help of sponsors and hosts and we’re always very grateful (and keep those offers coming!).

First up was Andrea Gazzarini presenting a brand new framework for search quality testing. Designed for offline measurement, Rated Ranking Evaluator is an open source Java library (although it can be used from other languages). It uses a heirarchical model to arrange queries into query groups (all queries in a query group should be producing the same results). Each test can run across a number of search engine configuration versions and outputs results in JSON format – but these can also be translated into Excel spreadsheets, PDFs or sent to a server that provides a live console showing how search quality is affected by a search engine configuration change. Although aimed at Elasticsearch and Solr, the platform is extensible to any underlying search engine. This is a very useful tool for search developers and joins Quepid and Searchhub’s recently released search analytics acquisition library in the ‘toolbox’ for relevance engineers. You can see Andrea’s slides here.

Martin White spoke next on how open source search solutions fare in corporate procurements for enterprise search. This was an engaging talk from Martin , showing the scale of the opportunities for open source platforms with budgets of several million pounds being common for enterprise search projects. However, as he mentioned it can be very difficult for procurement departments to get information from vendors and ‘the last thing you’ll know about a piece of enterprise software is how much it will cost’. He detailed how open source solutions often compare badly against closed source commercial offerings due to it being hard to see the ‘edges’ – e.g. what custom development will be necessary to fulfil enterprise requirements. Although the opportunities are clear, it seems open source based solutions still have a way to go to compete. You can read more from Martin on this subject in the recent free Search Insights report.

Thanks to Mimecast and both speakers – we’ll be back after the summer with another Meetup!

The post Lucene Solr London: Search Quality Testing and Search Procurement appeared first on Flax.

Defining relevance engineering part 2: learning the business

Charlie Hull — Tue, 26 Jun 2018 11:16:57 +0000

In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

Before a relevance engineer can install or configure a search engine they need to understand the business concerned. I’ve called this ‘learning the business’ and it’s something that Flax has to do on a weekly basis. One week we may be talking to a recruitment business that thinks and operates in terms of jobs, skills, candidates and roles; the next week it could be a company that sells specialised products and is more concerned with features, prices, availability, stock levels and pack sizes. Even within a single sector, each business will work in a slightly different way, although there will be some common factors.

Example data is key to learning how a business works, but is next to useless without someone to explain it in context. In some cases the business has lost some of the internal knowledge about how their own systems work: “Jeff built that database, but he left two years ago.”. What seems obvious to them may not be obvious to anyone else. Generic terms e.g. “products”, “location”, “keywords” can mean completely different things in each business context. If they exist, corporate glossaries, dictionaries or taxonomies are very useful, but again they may need annotating to explain what each entry means. If a glossary doesn’t exist, it’s a good first step to start one.

Finding the right people to talk to is also vital. Although relevance engineers are usually engaged or recruited by the IT department, this may not be the best place to learn about the business. The marketing department may have the best view of how the business interacts with its clients; the CEO or Managing Director will know the overall direction and objectives but may not have time for the detail; the content creators (which could be librarians, web editors or product information managers) will know about the items the search engine will need to find.

In many companies there are hierarchies and structures that sometimes actively prevent the sharing of information: it’s common to discover who blames who for past bad decisions and to be used as a sounding board by those with axes to grind. At Flax we try to make sure we talk to people at all levels in the client organisation: sometimes the most junior employees – and especially those who are customer-facing – have the most useful information as they have to deal with problems on a day-to-day basis. As external consultants one of our most useful skills is being able to listen without making sudden judgements or assumptions.

The end result of these many conversations is an understanding of where source data is created, gathered and stored; what a ‘search result’ is in the context of a particular business (a product on sale? A contract? A CV or resumé?) and how it might be constructed from this data; what a ‘relevant’ result is in this context (a more valuable product to sell? The most recent contract version? The best candidate for a job?) and how good/bad/nonexistent the current search solution is. This is vital information to be gathered before one even begins thinking about how to install, develop and/or configure and test a search solution.

In the next post I’ll cover how a relevance engineer might assess the technical capability of a business with respect to search. In the meantime you can read the free Search Insights 2018 report by the Search Network. Of course, feel free to contact us if you need help with relevance engineering.

The post Defining relevance engineering part 2: learning the business appeared first on Flax.

Defining relevance engineering, part 1: the background

Charlie Hull — Mon, 25 Jun 2018 10:40:12 +0000

Relevance Engineering is a relatively new concept but companies such as Flax and our partners Open Source Connections have been carrying out relevance engineering for many years. So what is a relevance engineer and what do they do? In this series of blog posts I’ll try to explain what I see as a new, emerging and important profession.

Let’s start by turning the clock back a few years. Ten or fifteen years ago search engines were usually closed source, mysterious black boxes, costing five or six-figure sums for even relatively modest installations (let’s say a couple of million documents – small by today’s standards). Huge amounts of custom code were necessary to integrate them with other systems and projects would take many months to demonstrate even basic search functionality. The trick was to get search working at all, even if the eventual results weren’t very relevant. Sadly even this was sometimes difficult to achieve.

Nowadays, search technology has become highly commoditized and many developers can build a functioning index of several milion documents in a couple of days with off-the-shelf, open source, freely available software. Even the commercial search firms are using open source cores – after all, what’s the point of developing them from scratch? Relevance is often ‘good enough’ out of the box for non business-critical applications.

A relevance engineer is required when things get a little more complicated and/or when good search is absolutely critical to your business. If you’re trading online, search can be a major driver of revenue and getting it wrong could cost you millions. If you’re worried about complying with the GDPR, MiFID or other regulations then ‘good enough’ simply isn’t if you want to prevent legal issues. If you’re serious about saving the time and money your employees waste looking for information or improving your business’ ability to thrive in a changing world then you need to do search right.

So what search engine should you choose before you find a relevance engineer to help with it? I’m going to go out on a limb here and say it doesn’t actually matter that much. At Flax we’re proponents of open source engines such as Apache Lucene/Solr and Elasticsearch (which have much to recommend them) but the plain fact is that most search engines are the same under the hood. They all use the same basic principles of information retrieval; they all build indexes of some kind; they all have to analyze the source data and user queries in much the same way (ignore ‘cognitive search’ and other ‘AI’ buzzwords for now, most of this is marketing rather than actual substance). If you’re using Microsoft Sharepoint across your business we’re not going to waste your time trying to convince you to move wholesale to a Linux-based open source alternative.

Any modern search engine should allow you the flexibility to adjust how data is ingested, how it is indexed, how queries are processed and how ranking is done. These are the technical tools that the relevance engineer can use to improve search quality. However, relevance engineering is never simply a technical task – in fact, without a business justification, adjusting these levers may make things worse rather than better.

In the next post I’ll cover how a relevance engineer can engage with a business to discover the why of relevance tuning. In the meantime you can read Doug Turnbull’s chapter in the free Search Insights 2018 report by the Search Network (the rest of the report is also very useful) and you might also be interested in the ‘Think like a relevance engineer’ training he is running soon in the USA. Of course, feel free to contact us for details of similar UK or EU-based training or if you need help with relevance engineering.

The post Defining relevance engineering, part 1: the background appeared first on Flax.

Haystack, the search relevance conference – day 2

Charlie Hull — Mon, 23 Apr 2018 15:23:56 +0000

Two weeks ago I attended the Haystack relevance conference – I’ve already written about my overall impressions and on the first day’s talks but the following are some more notes on the conference sessions. Note that some of the presentations I attended have already been covered in detail by Sujit Pal’s excellent blog. Some of the presentations I haven’t linked to directly have now appeared on the conference website.

The second day of the event started for me with the enjoyable job of hosting a ‘fishbowl’ style panel session titled “No, You Don’t Want to Do It Like That! Stories from the search trenches”. The idea was that a rotating panel of speakers would tell us tales of their worst and hopefully most instructive search tuning experiences and we heard some great stories – this was by its nature an informal session and I don’t think anyone kept any notes (probably a good idea in the case of commercial sensitivity!).

The next talk was my favourite of the conference, given by René Kriegler on relevance scoring using product data and image recognition. René is an expert on e-commerce search (he also runs the MICES event in Berlin which I’m looking forward to) and described how this domain is unlike many others: the interests of the consumer (e.g. price or availability) becoming part of the relevance criteria. One of the interesting questions for e-commerce applications is how ranking can affect profit. Standard TF/IDF models don’t always work well for e-commerce data with short fields, leading to a score that can be almost binary: as he said ‘a laptop can’t be more laptop-ish than another’. Image recognition is a potentially useful technique and he demonstrated a way to take the output Google’s Inception machine learning model and use it to enrich documents within a search index. However, there can be over 1000 vectors output from this model and he described how a technique called random projection trees can be used to partition the vector space and thus produce simpler data for adding to the index (I think this is basically like slicing up a fruitcake and recording whether a currant was one side of the knife or the other, but that may not be quite how it works!). René has built a Solr plugin to implement this technique.

Next I went to Matt Overstreet’s talk on Vespa, a recently open sourced search and Big Data library from Oath (a part of Yahoo! Inc.). Matt described how Vespa could be used to build highly scalable personalised recommendation, search or realtime data display applications and took us through how Vespa is configured through a series of APIs and XML files. Interestingly (and perhaps unsurprisingly) Vespa has very little support for languages other than English at present. Queries are carried out through its own SQL-like language, YQL, and grouping and data aggregation functions are available. He also described how Vespa can use multidimensional arrays of values – tensors, for example from a neural network. Matt recommended we all try out Vespa – but on a cloud service not a low-powered laptop!

Ryan Pedala was up next to talk about named entity recognition (NER) and how it can be used to annodate or label data. He showed his experiments with tools including Prodigy and a custom GUI he had built and compared various NER libraries such Stanford NLP and OpenNLP and referenced an interesting paper on NER for travel-related queries. I didn’t learn a whole lot of new information from this talk but it may have been useful to those who haven’t considered using NER before.

Scott Stultz talked next on how to integrate business rules into a search application. He started with examples of key performance indicators (KPIs) that can be used for search – e.g. conversion ratios or average purchase values and how these should be tied to search metrics. They can then be measured both before and after changes are made to the search application: automated unit tests and more complex integration tests should also be used to check that search performance is actually improving. Interestingly for me he included within the umbrella of integration tests such techniques as testing the search with recent queries extracted from logs. He made some good practical points such as ‘think twice before adding complexity’ and that good autocomplete will often ‘cannibalize’ existing search as users simply choose the suggested completion rather than finishing typing the entire query. There were some great tips here for practical business-focused search improvements.

I then went to hear John Kane’s talk about interleaving for relevancy tuning which covered a method for updating a machine learning model in real-time using feedback from the current ranking powered by this model – simply by interleaving the results from two versions of this model. This isn’t a particularly new technique and the talk was somewhat of a product pitch for 904Labs, but the technique does apparently work and some customers have seen a 30% increase in conversion rate.

The last talk of the day came from Tim Allison on an evaluation platform for Apache Tika, a well-known library for text extraction from a variety of file formats. Interspersed with tales of ‘amusing’ and sometimes catastrophic ways for text extraction to fail, Tim described how tika-eval can be used to test how good Tika is at extracting data and output a set of metrics e.g. how many different MIME file types were found. The tool is now used to run regular regression tests for Tika on a dataset of 3 million files from the CommonCrawl project. We’re regular users of Tika at Flax and it was great to hear about the project is moving forward.

Doug Turnbull finished the conference with a brief summing up and thanks. There was a general feeling in the room that this conference was the start of something big and people were already asking when the next event would be! One of my takeaways from the event was that even though many of the talks used open source tools (perhaps unsurprisingly as it is so much easier to talk about these publically) the relevance tuning techniques and methods described can be applied to any search engine. The attendees were from a huge variety of companies, large and small, open and closed source based. This was an event about relevance engineering, not technology choices.

Thanks to all at OSC who made the event possible and for inviting us all to your home town – I think most if not all of us would happily visit again.

The post Haystack, the search relevance conference – day 2 appeared first on Flax.

Search Insights 2018 – a free, independent report on search

Charlie Hull — Mon, 26 Mar 2018 08:43:15 +0000

Over the last 17 years of running Flax I’ve met many people who loudly profess to be experts in various aspects of the search business. Some have a new product or service to sell, that promises to change the game forever; quite often this turns out to be snake oil or simply a new name for an old solution. Others seem to have arrived suddenly, fully-fledged, enthusiastic to convince us old hands that everything will be different now if we all sign up to their new idea.

There’s also a small group of people who tend to be quieter about their expertise, perhaps because as independent practitioners or small business owners they’re not supported by the marketing budgets of large companies. These people survive on their reputation, which has been built steadily on a record of solid advice, honesty and neutrality. I’m now lucky enough to be part of this group – an informal network of experts in subjects as diverse as search for Sharepoint, intranet strategy and taxonomy management. Occasionally we collaborate on projects, often we recommend each other to our clients and it’s always hugely enjoyable to meet in person and discuss the latest trends and industry landscape. This informal network means Flax can offer more services to our clients – and if we can’t help, we probably know someone we trust who can.

So I’m very proud to announce that this group – the Search Network – are releasing a joint publication, Search Insights 2018. In this 70-page collection of essays you can learn how to research, procure, choose, budget, plan and run a search project in the best way for your business and your users.

Unlike some other industry reports, we’re not charging for this report, you won’t have to register or give us your email address, and it’s Creative Commons licensed so you can even redistribute it if you like (with attribution). There’s no sponsorship, no plotting of vendors on confusing trend diagrams, no marketing buzzwords or direct recommendations – after all, we’re independent. We welcome any feedback you have of course.

My personal thanks to Martin White who has led this effort and who has also written about the Network and the report.

The post Search Insights 2018 – a free, independent report on search appeared first on Flax.

When even the commercial vendors are using it, has open source search won?

Charlie Hull — Thu, 15 Mar 2018 12:03:32 +0000

There have been some interesting announcements recently which may point to an increasing realisation amongst commercial search firms that an open source model is an essential advantage in today’s search market. Coveo have announced that their enterprise search engine can run on an Elasticsearch core, an interesting move for a previously decidedly closed source company. BA Insight, who have previously provided extensions and enhancements for Microsoft’s decidedly closed-source Sharepoint search facility, have been offering Elasticsearch as a core search engine for quite a while. It is also an open secret that some other commercial search firms (such as Attivio) use Apache Lucene as a core technology.

The commercial search firms will have noticed that Lucidworks (who employ a large proportion of Lucene/Solr committers) have announced Lucidworks Fusion 4, which can be used for site and enterprise search. Elastic, the company behind Elasticsearch, recently acquired Swiftype and have repositioned it as a packaged site search engine (with an enterprise search version in beta and rumoured to appear later this year). Both Lucidworks and Elastic are thus attempting to capture a larger segment of the search market, using their dominance and expertise in the open source world. Note however that all these products are ‘open core’ rather than ‘open source’ (despite Elastic’s attempts to pretend otherwise) – which is not very different from Coveo or BA Insight’s approach – so the distance between the traditonally separate ‘open source’ and ‘closed source’ search vendors is now closing.

The question for any search vendor should be whether there is any point developing and maintaining a closed source search engine core, when Lucene derivatives such as Solr and Elasticsearch are so well established. The race between closed and open source is perhaps over.

Here at Flax we’ve been building open source search engines since 2001 and we’re independent of any vendor – so if you need help with your search project, do let us know.

Note: Enterprise Search is usually defined as a search engine working behind a corporate firewall, indexing different content sources such as flat files, databases and intranets. Site Search is usually visible to non-employees and only indexes websites. However, when site search includes an intranet the boundary becomes a little fuzzy – is this lightweight enterprise search? In most cases this doesn’t hugely matter – the underlying search engine core will be the same, it’s simply a difference in where source data comes from and how it is presented to users. However, these two options are often presented as different products by vendors.

UPDATE: A few days after I posted this blog, commercial vendor Attivio released SUIT, an open source user interface library that can run on their own engine, Elasticsearch or Solr. It seems the trend continues.

The post When even the commercial vendors are using it, has open source search won? appeared first on Flax.