open source – Flax

Lucene Solr London: Search Quality Testing and Search Procurement

Charlie Hull — Fri, 29 Jun 2018 11:09:34 +0000

Mimecast were our kind hosts for the latest London Lucene/Solr Meetup (and even provided goodie bags). It’s worth repeating that we couldn’t run these events without the help of sponsors and hosts and we’re always very grateful (and keep those offers coming!).

First up was Andrea Gazzarini presenting a brand new framework for search quality testing. Designed for offline measurement, Rated Ranking Evaluator is an open source Java library (although it can be used from other languages). It uses a heirarchical model to arrange queries into query groups (all queries in a query group should be producing the same results). Each test can run across a number of search engine configuration versions and outputs results in JSON format – but these can also be translated into Excel spreadsheets, PDFs or sent to a server that provides a live console showing how search quality is affected by a search engine configuration change. Although aimed at Elasticsearch and Solr, the platform is extensible to any underlying search engine. This is a very useful tool for search developers and joins Quepid and Searchhub’s recently released search analytics acquisition library in the ‘toolbox’ for relevance engineers. You can see Andrea’s slides here.

Martin White spoke next on how open source search solutions fare in corporate procurements for enterprise search. This was an engaging talk from Martin , showing the scale of the opportunities for open source platforms with budgets of several million pounds being common for enterprise search projects. However, as he mentioned it can be very difficult for procurement departments to get information from vendors and ‘the last thing you’ll know about a piece of enterprise software is how much it will cost’. He detailed how open source solutions often compare badly against closed source commercial offerings due to it being hard to see the ‘edges’ – e.g. what custom development will be necessary to fulfil enterprise requirements. Although the opportunities are clear, it seems open source based solutions still have a way to go to compete. You can read more from Martin on this subject in the recent free Search Insights report.

Thanks to Mimecast and both speakers – we’ll be back after the summer with another Meetup!

The post Lucene Solr London: Search Quality Testing and Search Procurement appeared first on Flax.

Haystack, the search relevance conference – day 2

Charlie Hull — Mon, 23 Apr 2018 15:23:56 +0000

Two weeks ago I attended the Haystack relevance conference – I’ve already written about my overall impressions and on the first day’s talks but the following are some more notes on the conference sessions. Note that some of the presentations I attended have already been covered in detail by Sujit Pal’s excellent blog. Some of the presentations I haven’t linked to directly have now appeared on the conference website.

The second day of the event started for me with the enjoyable job of hosting a ‘fishbowl’ style panel session titled “No, You Don’t Want to Do It Like That! Stories from the search trenches”. The idea was that a rotating panel of speakers would tell us tales of their worst and hopefully most instructive search tuning experiences and we heard some great stories – this was by its nature an informal session and I don’t think anyone kept any notes (probably a good idea in the case of commercial sensitivity!).

The next talk was my favourite of the conference, given by René Kriegler on relevance scoring using product data and image recognition. René is an expert on e-commerce search (he also runs the MICES event in Berlin which I’m looking forward to) and described how this domain is unlike many others: the interests of the consumer (e.g. price or availability) becoming part of the relevance criteria. One of the interesting questions for e-commerce applications is how ranking can affect profit. Standard TF/IDF models don’t always work well for e-commerce data with short fields, leading to a score that can be almost binary: as he said ‘a laptop can’t be more laptop-ish than another’. Image recognition is a potentially useful technique and he demonstrated a way to take the output Google’s Inception machine learning model and use it to enrich documents within a search index. However, there can be over 1000 vectors output from this model and he described how a technique called random projection trees can be used to partition the vector space and thus produce simpler data for adding to the index (I think this is basically like slicing up a fruitcake and recording whether a currant was one side of the knife or the other, but that may not be quite how it works!). René has built a Solr plugin to implement this technique.

Next I went to Matt Overstreet’s talk on Vespa, a recently open sourced search and Big Data library from Oath (a part of Yahoo! Inc.). Matt described how Vespa could be used to build highly scalable personalised recommendation, search or realtime data display applications and took us through how Vespa is configured through a series of APIs and XML files. Interestingly (and perhaps unsurprisingly) Vespa has very little support for languages other than English at present. Queries are carried out through its own SQL-like language, YQL, and grouping and data aggregation functions are available. He also described how Vespa can use multidimensional arrays of values – tensors, for example from a neural network. Matt recommended we all try out Vespa – but on a cloud service not a low-powered laptop!

Ryan Pedala was up next to talk about named entity recognition (NER) and how it can be used to annodate or label data. He showed his experiments with tools including Prodigy and a custom GUI he had built and compared various NER libraries such Stanford NLP and OpenNLP and referenced an interesting paper on NER for travel-related queries. I didn’t learn a whole lot of new information from this talk but it may have been useful to those who haven’t considered using NER before.

Scott Stultz talked next on how to integrate business rules into a search application. He started with examples of key performance indicators (KPIs) that can be used for search – e.g. conversion ratios or average purchase values and how these should be tied to search metrics. They can then be measured both before and after changes are made to the search application: automated unit tests and more complex integration tests should also be used to check that search performance is actually improving. Interestingly for me he included within the umbrella of integration tests such techniques as testing the search with recent queries extracted from logs. He made some good practical points such as ‘think twice before adding complexity’ and that good autocomplete will often ‘cannibalize’ existing search as users simply choose the suggested completion rather than finishing typing the entire query. There were some great tips here for practical business-focused search improvements.

I then went to hear John Kane’s talk about interleaving for relevancy tuning which covered a method for updating a machine learning model in real-time using feedback from the current ranking powered by this model – simply by interleaving the results from two versions of this model. This isn’t a particularly new technique and the talk was somewhat of a product pitch for 904Labs, but the technique does apparently work and some customers have seen a 30% increase in conversion rate.

The last talk of the day came from Tim Allison on an evaluation platform for Apache Tika, a well-known library for text extraction from a variety of file formats. Interspersed with tales of ‘amusing’ and sometimes catastrophic ways for text extraction to fail, Tim described how tika-eval can be used to test how good Tika is at extracting data and output a set of metrics e.g. how many different MIME file types were found. The tool is now used to run regular regression tests for Tika on a dataset of 3 million files from the CommonCrawl project. We’re regular users of Tika at Flax and it was great to hear about the project is moving forward.

Doug Turnbull finished the conference with a brief summing up and thanks. There was a general feeling in the room that this conference was the start of something big and people were already asking when the next event would be! One of my takeaways from the event was that even though many of the talks used open source tools (perhaps unsurprisingly as it is so much easier to talk about these publically) the relevance tuning techniques and methods described can be applied to any search engine. The attendees were from a huge variety of companies, large and small, open and closed source based. This was an event about relevance engineering, not technology choices.

Thanks to all at OSC who made the event possible and for inviting us all to your home town – I think most if not all of us would happily visit again.

The post Haystack, the search relevance conference – day 2 appeared first on Flax.

No, Elastic X-Pack is not going to be open source – according to Elastic themselves

Charlie Hull — Fri, 02 Mar 2018 14:47:49 +0000

Elastic are the company founded by the creator of Elasticsearch, Shay Banon. At this time of year they have their annual Elasticon conference in San Francisco and as you might expect a lot of announcements are made during the week of the conference. The major ones to appear this time are that Swiftype, which Elastic acquired last year, has reappeared as Elastic Site Search and that Elastic are opening the code for their commercial X-Pack features.

Shay Banon is always keen to relate how Elasticsearch started as open source and will remain true to that heritage, which is always encouraging to hear. However it’s unfortunate to note that the announcement has been reported by many as ‘X-Pack is now open source’ – and the truth is a little more complicated than that.

Firstly, let’s look at the Elasticsearch core code itself. Yes, this is open source under the Apache 2 license, so you can download it, modify it, fork it, even incorporate it into your own products if you like. However most people would like to keep up with the latest and greatest developments so they’ll want to stick with the ‘official’ stream of updates, and what goes into this is entirely up to Elastic employees as they are the only ones allowed to commit to the codebase. Some measure of control of an open source project is essential of course, but this is certainly not ‘open development’ even though it is ‘open source’. Compare this to Apache Lucene/Solr, where those that are allowed to commit code to the official releases are from a wide variety of organisations (and elected as committers by merit, by a group of other longstanding committers). This distinction is important but makes little difference to most adopters.

Elastic have also for some years produced commercial, closed-source software in addition to Elasticsearch – which they call the X-Pack. To use this code you have to license it, although for some of the features the license is free. The announcement this week is that the source code for the X-Pack will be open and available to read under a Elastic license (which hasn’t yet been made available). As Doug Turnbull of our partner company Open Source Connections writes “Be careful: The ‘open source’ Elastic XPack is very different than what most think of as ‘open source'”. To use some of these features you have the source code for in production, you will still need to pay Elastic for a license. If you spot a problem in the source code and submit a patch, you still may end up paying Elastic for the privilege of running it. This is an ‘open core’ model, where the further you move away from the core, the less open and free things become – and as Shay writes this is a key part of their business model.

The final word on this comes from Elastic’s own FAQ on the X-Pack: ” Open Source licensing maintains a strict definition from the Open Source Initiative (OSI). As of 6.3, the X-Pack code will be opened under an Elastic EULA. However, it will not be ‘Open Source’ as it will not be covered by an OSI approved license. “. It’s a shame that this hasn’t been accurately reported.

If you are considering open source search software for your project, contact us for independent and honest advice. We’ve been building open source search applications since 2001.

The post No, Elastic X-Pack is not going to be open source – according to Elastic themselves appeared first on Flax.

Inspiring students to work in Open Source Search

Charlie Hull — Wed, 31 Jan 2018 13:54:16 +0000

I’ve recently been asked to join the Industrial Advisory Board for the School of Computer Science and Electronic Engineering at the University of Essex and will be talking to students there on Monday 5th February, repeating a similar talk I did last year. The subject is ‘Working in Open Source Search’ and I’ll describe how we founded Flax back in 2001, how we’ve built, tuned and implemented open source search engines and some of the client projects we’ve worked on. It’s been a fascinating journey.

My main motivation for talking at Essex (and at City University a couple of weeks later) will be to inspire students to consider working in the world of open source software and more specifically the commercial applications of what academics call information retrieval – search engines. It’s an interesting field to work in – we have clients in a huge variety of sectors including e-commerce, law, publishing and government; we deal with both small startups and multinational businesses and help build systems indexing a few thousand to several billion items. It’s constantly changing as new requirements, ideas and innovations appear. It’s taken our staff around the world (Singapore, Malaysia, the USA, Denmark as a small sample from the last couple of years) and led to us gaining a global reputation and becoming part of a select group of independent search specialists. From being somewhat of a curiosity when we started, open source search engines have now gained huge acceptance and have changed the search market beyond recognition – no longer can vendors charge six or seven figures for mysterious black boxes (and more to make them actually do something useful).

However our sector needs more people – not just developers, but business-focused search managers who understand how to build search engines that truly deliver value to employees and customers. As I’ll say to the students next week there’s a skill shortage, plainly illustrated by the plaintive slide that ends nearly all search conference and Meetup presentations – “We’re Hiring!”. Time to learn to code, download Lucene/Solr or Elasticsearch, try out the examples, read our book and look forward to a great career in search!

The post Inspiring students to work in Open Source Search appeared first on Flax.

It’s not just about technology – training for search managers is vital

Charlie Hull — Tue, 07 Nov 2017 14:19:19 +0000

A few weeks ago I sat in on a workshop in London at the Taxonomy Boot Camp conference, run by Jeff Fried of BA Insight. I’ve known Jeff for many years from various events and we share some views on how search systems should be built and managed – using best-of-breed technology and effective management processes.

He was kind enough to ask me to join a recent podcast. During the podcast, we had a great conversation about open source search, enterprise search, our recent book and whether Cognitive Search actually exists. Listen to the podcast here.

There is a small group of us working in the search business who believe that a major obstacle to the success of search projects is the lack of guidance, support and training for search managers. It’s not just about technology – bear in mind that no matter what their marketing tells you, most search engines are basically the same – but how you apply that technology to business problems. The career path to becoming a search manager is unclear and rocky, with no formal training available, no professional association to join and little peer recognition. Since few at executive level understand much about search technology and many are swayed by the latest marketing buzzwords it can be a thankless task. I suspect many search managers have to explain the same things again and again: why you can’t just ‘make it work like Google’, why promises of Artificial Intelligence and Machine Learning by the big search vendors won’t help if your content and metadata is still a mess, and why all the problems with the current search implementation won’t be fixed overnight simply by buying or building a new search engine.

Jeff’s workshop was only scheduled for three hours and we quickly realised there was so much to discuss that the agenda had to be curtailed. He’s running a similar workshop this week in Washington DC at the Enterprise Search & Discovery and I’m looking forward to hearing how it goes. Hopefully this will feed into the ongoing discussions amongst the professional community (comprising independent search experts from across the world with decades of experience, working in varying areas such as Sharepoint, open source, legacy search technology and intranet consultancy – we all think things have to change) around how we can better support search managers with effective training, qualifications, reports and other resources.

Watch this space – and in the meantime, if you need help with either the technical or management aspects of search, do get in touch.

The post It’s not just about technology – training for search managers is vital appeared first on Flax.

How to build a search relevance team

Charlie Hull — Mon, 11 Sep 2017 11:08:48 +0000

We’ve spent a lot of time working with clients who recognise that their search engine isn’t delivering relevant results to users. Often this is seen as solely a technical problem, which can be resolved simply by changing query parameters or the search engine configuration – but technical teams need clear direction on why a result should or should not appear at a certain position, not just request for general relevance improvements.

It’s thus important to consider relevance as a business-wide issue, with multiple stakeholders providing input to the tuning process. We recommend the creation of a search relevance team – in a perfect world this should consist of dedicated staff, but even in the largest organisations this can be difficult to resource. It’s possible however to create a team to share the responsibility of improving relevance, contributing as they can.

The team should be drawn from the following business areas. Note that in some organisations some of these roles will be shared.

Content – the content team create and manage the source data for the search engine, are responsible for keeping this data clean and consistent with reliable metadata. They may process external data into a database or other repository as well as creating it from scratch. The best search engine in the world can’t give good results if the underlying data is unreliable, inconsistent or badly formatted.
Vendor – if the search engine is a commercial product, the vendor must provide sufficient documentation, training and support to the client to allow the engine to be tuned. If the engine is an open source project this information should be openly available and backed up by specialist consultancies who can provide training and technical support (such as Flax).
Development – the development team are responsible for integrating the search engine into the client’s systems, indexing the source data, maintaining the configuration, writing the search queries and adding new features. They will make any changes that will improve relevance.
Testing – the test team should create a process for test-driven relevance tuning, using tools such as Quepid to gather relevance judgements from the business. The test cases themselves can be built up from a combination of query logs, known important query terms (e.g. new products, common industry terms, SEO terms) and those queries deemed most valuable to the business.
Operations – this team is responsible for keeping the search engine running at best performance with appropriate server provision and monitoring, plus providing a failover capacity as required.
Sales & marketing, product owners – these teams should know why a particular result is more relevant than another to a customer or other user, by gathering online feedback, talking to users and knowing the current business goals. This team can thus help create the test cases discussed above.
Management – management support of the relevance tuning process is essential, to commit whatever resources are required to the technical implementation and test process and to lead the search relevance team.

The search relevance team should meet on a regular basis to discuss how to build test cases for important search queries, examine the current position in terms of search relevance and set out objectives for improving relevance. The metrics chosen to measure progress should be available to all of the team.

Search relevance tuning should be seen as a shared responsibility, rather than simply a technical issue or something that can be easily resolved by building or buying a new search engine (a new, un-tuned search engine is unlikely to be as good as the current one). A well structured and resourced search relevance team can make huge strides towards improving search across the business – reducing the time users take to find information and improving responsiveness. For businesses that trade online, relevant search results are simply essential for retaining customers and a high level of conversion.

Flax regularly visit clients to discuss how to build an effective search team – do get in touch if we can help your business in this way.

The post How to build a search relevance team appeared first on Flax.

A lack of cognition and some fresh FUD from Forrester

Charlie Hull — Wed, 14 Jun 2017 09:05:36 +0000

Last night the estimable Martin White, intranet and enterprise search expert and author of many books on the subject, flagged up two surprising articles from Forrester who have declared that Cognitive Search (we’ll define this using their own terms in a little while) is ‘overshadowing’ the ‘outmoded’ Enterprise Search, with a final dig at how much better commercial options are compared to open source.

Let’s start with the definition, helpfully provided in another post from Forrester. Apparently ‘Cognitive search solutions are different because they: Scale to handle a multitude of data sources and types’. Every enterprise search engine promises to index a multiplicity of content both structured and unstructured, so I can’t see why this is anything new. Next we have ‘Employ artificial intelligence technologies….natural language processing (NLP) and machine learning’. Again, NLP has been a feature of closed and open source enterprise search systems for years, be it for entity extraction, sentiment analysis or sentence parsing. Machine learning is a rising star but not always easy to apply to search problems. However I’m not convinced either of these are really ‘artificial intelligence’. Astonishingly, the last point is that Cognitive solutions ‘Enable developers to build search applications…provide SDKs, APIs, and/or visual design tools’. Every search engine needs user applications on top and has APIs of some kind, so this makes little sense to me.

Returning to the first article, we hear that indexing is ‘old fashioned’ (try building a search application without indexing – I’d love to know you’d manage that!) but luckily a group of closed-source search vendors have managed to ‘out-innovate’ the open source folks. We have the usual hackneyed ‘XX% of knowledge workers can’t find what they need’ phrases plus a sprinkling of ‘wouldn’t it be nice if everything worked like Siri or Amazon or Google’ (yes, it would, but comparing systems built on multi-billion-page Web indexes by Internet giants to enterprise search over at most a few million, non-curated, non-hyperlinked business documents is just silly – these are entirely different sets of problems). Again, we have mentions of basic NLP techniques like they’re something new and amazing.

The article mentions a group of closed source vendors who appear in Forrester’s Wave report, which like Gartner’s Magic Quadrant attempts to boil down what is in reality a very complex field into some overly simplistic graphics. Finishing with a quick dig at two open source companies (Elastic, who don’t really sell an enterprise search engine anyway, and Lucidworks whose Fusion 3 product really is a serious contender in this field, integrating Apache Spark for machine learning) it ignores the fact that open source search is developing at a furious rate – and there are machine learning features that actually work in practise being built and used by companies such as Bloomberg – and because they’re open source, these are available for anyone else to use.

To be honest It’s very difficult, if not impossible, to out-innovate thousands of developers across the world working in a collaborative manner. What we see in articles like the above is not analysis but marketing – a promise that shiny magic AI robots will solve your search problems, even if you don’t have a clear specification, an effective search team, clean and up-to-date content and all the many other things that are necessary to make search work well (to research this further read Martin’s books or the one I’m co-authoring at present – out later this year!). One should also bear in mind that marketing has to be paid for – and I’m pretty sure that the various closed-source vendors now providing downloads of Forrester’s report (because of course, they’re mentioned positively in it) don’t get to do so for free.

UPDATE: Martin has written three blog posts in response to both Gartner and Forrester’s recent reports which I urge you (and them) to read if you really want to know how new (or not) Cognitive Search is.

The post A lack of cognition and some fresh FUD from Forrester appeared first on Flax.

Release 1.0 of Marple, a Lucene index detective

Tom — Fri, 24 Feb 2017 14:34:05 +0000

Back in October at our London Lucene Hackday Flax’s Alan Woodward started to write Marple, a new open source tool for inspecting Lucene indexes. Since then we have made nearly 240 commits to the Marple GitHub repository, and are now happy to announce its first release.

Marple was envisaged as an alternative to Luke, a GUI tool for introspecting Lucene indexes. Luke is a powerful tool but its Java GUI has not aged well, and development is not as active as it once was. Whereas Luke uses Java widgets, Marple achieves platform independence by using the browser as the UI platform. It has been developed as two loosely-coupled components: a Java and Dropwizard web service with a REST/JSON API, and a UI implemented in React.js. This approach should make development simpler and faster, especially as there are (arguably) many more React experts around these days than native Java UI developers, and will also allow Marple’s index inspection functionality to be easily added to other applications.

Marple is, of course, named in honour of the famous fictional detective created by Agatha Christie.

What is Marple for? We have two broad use cases in mind: the first is as an aid for solving problems with Lucene indexes. With Marple, you can quickly examine fields, terms, doc values, etc. and check whether the index is being created as you expect, and that your search signals are valid. The other main area of use we imagine is as an educational tool. We have made an effort to make the API and UI designs reflect the underlying Lucene APIs and data structures as far as is practical. I have certainly learned a lot more about Lucene from developing Marple, and we hope that other people will benefit similarly.

The current release of Marple is not complete. It omits points entirely, and has only a simple UI for viewing documents (stored fields). However, there is a reasonably complete handling of terms and doc values. We’ll continue to develop Marple but of course any contributions are welcome.

You can download this first release of Marple here together with a small Lucene index of Project Gutenberg to inspect. Details of how to run Marple (you’ll need Java) are available in the README. Do let us know what you think – bug reports or feature requests can be submitted via Github. We’ll also be demonstrating Marple in London on March 23rd 2017 at the next London Lucene/Solr Meetup.

The post Release 1.0 of Marple, a Lucene index detective appeared first on Flax.

Making sense of Big Data with open source search

Charlie Hull — Fri, 11 Nov 2016 16:47:24 +0000

Making sense of big data from Charlie Hull

The post Making sense of Big Data with open source search appeared first on Flax.

A tale of two cities (and two Lucene Hackdays)

Charlie Hull — Fri, 21 Oct 2016 10:27:00 +0000

To mark Flax’s 15th anniversary we ran two Lucene Hackdays recently, in London and Boston. I even made some Flax cakes! The London event was attended by around 20 people from companies both large and small and kindly hosted by Bloomberg (who are currently very active in the Lucene/Solr community). We split up into a number of groups to work on a range of projects. Erica Sundberg from Blackrock took a group of beginners through installing Solr and indexing their first collection, while also considering how a minimal Solr example could be built (some of the shipped examples being rather complex). Another team led by Christine Poerschke of Bloomberg looked at a way to avoid slightly different statistics being returned from different Solr replicas (which can cause result ordering to appear to ‘jump’) and Diego Ceccarelli looked at adding BM25F ranking to Lucene. Other groups looked at SQL streaming with Solr (committer Joel Bernstein dialed in via Skype to help) and Flax’s Alan Woodward worked on Marple, a browser-based explorer for Lucene indexes. The day finished with a curry dinner kindly sponsored by Alfresco.

Several days later we ran a similar Hackday in Boston, as many Lucene people were in town for Lucene Revolution. Many more Lucene/Solr committers attended this time and enjoyed a chance to work on their own projects or to continue some of the work we’d started in London. Doug Turnbull came up with a way to do BM25F ranking with existing Lucene features while Alexandre Ravalovitch and I had a long conversation about minimal Solr examples and improving the way beginners can start with Solr. Other projects included new field types for Lucene, improved highlighters and DocValues. BA Insight were kind enough to provide the venue and Lucidworks sponsored drinks and snacks later in the pub downstairs.

We’ve gathered notes on what we worked on with links to some of the software we developed here – please do get involved if you can! In particular the Marple project is attracting further contributions (and interest from those who developed and maintain the existing Luke Lucene index inspector).

I’d like to thank everyone who came to the Hackdays, our generous sponsors for providing venues, food and drink and to those who helped organise the events. The feedback has been excellent (and do let us know if you have any further comments) and people seem keen for this to be a regular event before the annual Lucene Revolution conference – a chance to work on Lucene-based projects outside of regular work, to meet, network and spend time with other contributors and to enjoy being part of a great open source community. We’ll be back!

The post A tale of two cities (and two Lucene Hackdays) appeared first on Flax.