E-commerce – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 Catching MICES – a focus on e-commerce search http://www.flax.co.uk/blog/2018/06/19/catching-mices-a-focus-on-e-commerce-search/ http://www.flax.co.uk/blog/2018/06/19/catching-mices-a-focus-on-e-commerce-search/#respond Tue, 19 Jun 2018 14:15:55 +0000 http://www.flax.co.uk/?p=3831 The second event I attended in Berlin last week was the Mix Camp on e-commerce search (MICES), a small and focused event now in its second year and kindly hosted by Mytoys at their offices. Slides for the talks are … More

The post Catching MICES – a focus on e-commerce search appeared first on Flax.

]]>
The second event I attended in Berlin last week was the Mix Camp on e-commerce search (MICES), a small and focused event now in its second year and kindly hosted by Mytoys at their offices. Slides for the talks are available here and I hope videos will appear soon.

The first talk was given by Karen Renshaw of Grainger, who Flax worked with at RS Components (she also wrote a great series of blog posts for us on improving relevancy). Karen’s talk drew on her long experience of managing search teams from a business standpoint – this wasn’t about technology but about combining processes, targets and objectives to improve search quality. She showed how to get started by examining customer feedback, known issues, competitors and benchmarks; how to understand and categorise query types; create a test plan within a cross-functional team and to plan for incremental change. Testing was covered including how to score search quality and how to examine the impact of search changes, with the message that “all aspects of search should work together to help customers through their journey”. She concluded with the clear point that there are no silver bullets, and that expectations must be managed during an ongoing, iterative process of improvement. This was a talk to set the scene for the day and containing lessons for every search manager (and a good few search technologists who often ignore the business factors!).

Next up were Christine Bellstedt & Jens Kürsten from Otto, Germany’s second biggest online retailer with over 850,000 search queries a day. Their talk focused on bringing together the users and business perspective to create a search quality testing cycle. They quoted Peter Freis’ graphic from his excellent talk at Haystack to illustrate how they created an offline system for experimentation with new ranking methods based on linear combinations of relevance scores from Solr, business performance indicators and product availability. They described how they learnt how hard it can be to select ranking features, create test query sets with suitable coverage and select appropriate metrics to measure. They also talked about how the experimentation cycle can be used to select ‘challengers’ to the current ‘champion’ ranking method, which can then be A/B tested online.

Pavel Penchev of SearchHub was next and presented their new search event collector library – a Javascript SDK which can be used to collect all kinds of metrics around user behaviour and submit them directly to a storage or analytics system (which could even be a search engine itself – e.g. Elasticsearch/Kibana). This is a very welcome development – only a couple of months ago at Haystack I heard several people bemoaning the lack of open source tools for collecting search analytics. We’ll certainly be trying out this open source library.

Andreas Brückner of e-commerce search vendor Fredhopper talked about the best way to optimise search quality in a business context. His ten headings included “build a dedicated search team” – although 14% of Fredhoppers own customers have no dedicated search staff – “build a measurement framework” – how else can you see how revenue might be improved? and “start with user needs, not features”. Much to agree with in this talk from someone with long experience of the sector from a vendor viewpoint.

Johannes Peter of MediaMarktSaturn described an implementation of a ‘semantic’ search platform which attempts to understand queries such as ‘MyMobile 7 without contract’, recognising this is a combination of a product name, a Boolean operator and an attribute. He described how an ontology (perhaps showing a family of available products and their variants) can be used in combination with various rules to create a more focused query e.g. “title:(“MyMobile7″) AND NOT (flag:contract)”. He also mentioned machine learning and term co-occurrence as useful methods but stressed that these experimental techniques should be treated with caution and one should ‘fail early’ if they are not producing useful results.

Ashraf Aaref & Felipe Besson described their journey using Learning to Rank to improve search at GetYourGuide, a marketplace for activities (e.g. tours and holidays). Using Elasticsearch and the LtR plugin recently released by our partners OpenSourceConnections they tried to improve the results for their ‘location pages’ (e.g. for Paris) but their first iteration actually gave worse results than the current system and was thus rejected by their QA process. They hope to repeat the process using what they have learned about how difficult it is to create good judgement data. This isn’t the first talk I’ve seen that honestly admits that ML approaches to improving search aren’t a magic silver bullet and the work itself is difficult and requires significant investment.

Duncan Blythe of Zalando gave what was the most forward-looking talk of the event, showing a pure Deep Learning approach to matching search queries to results – no query parsing, language analysis, ranking or anything, just a system that tries to learn what queries match which results for a product search. This reminded me of Doug & Tommaso’s talk at Buzzwords a couple of days before, using neural networks to learn the journey between query and document. Duncan did admit that this technique is computationally expensive and in no way ready for production, but it was exciting to hear about such cutting-edge (and well funded) research.

Doug Turnbull was the last speaker with a call to arms for more open source tooling, datasets and relevance judgements to be made available so we can all build better search technology. He gave a similar talk to keynote the Haystack event two months ago and you won’t be surprised to hear that I completely agree with his viewpoint – we all benefit from sharing information.

Unfortunately I had to leave MICES at this point and missed the more informal ‘bar camp’ event to follow, but I would like to thank all the hosts and organisers especially René Kriegler for such an interesting day. There seems to be a great community forming around e-commerce search which is highly encouraging – after all, this is one of the few sectors where one can draw a clear line between improving relevance and delivering more revenue.

The post Catching MICES – a focus on e-commerce search appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/06/19/catching-mices-a-focus-on-e-commerce-search/feed/ 0
Boosts Considered Harmful – adventures with badly configured search http://www.flax.co.uk/blog/2016/08/19/boosts-considered-harmful-adventures-badly-configured-search/ http://www.flax.co.uk/blog/2016/08/19/boosts-considered-harmful-adventures-badly-configured-search/#comments Fri, 19 Aug 2016 13:10:10 +0000 http://www.flax.co.uk/?p=3348 During a recent client visit we encountered a common problem in search – over-application of ‘boosts’, which can be used to weight the influence of matches in one particular field. For example, you might sensibly use this to make results … More

The post Boosts Considered Harmful – adventures with badly configured search appeared first on Flax.

]]>
During a recent client visit we encountered a common problem in search – over-application of ‘boosts’, which can be used to weight the influence of matches in one particular field. For example, you might sensibly use this to make results that match a query on their title field come higher in search results. However in this case we saw huge boost values used (numbers in the hundreds) which were probably swamping everything else – and it wasn’t at all clear where the values had come from, be it experimentation or simply wild guesses. As you might expect, the search engine wasn’t performing well.

A problem with both Solr, Elasticsearch and other search engines is that so many factors can affect the ordering of results – the underlying relevance algorithms, how source data is processed before it is indexed, how queries are parsed, boosts, sorting, fuzzy search, wildcards…it’s very easy to end up with a confusing picture and configuration files full of conflicting settings. Often these settings are left over from example files or previous configurations or experiments, without any real idea of why they were used. There are so many dials to adjust and switches to flick, many of which are unnecessary. The problem is compounded by embedding the search engine within another system (e.g. a content management platform or e-commerce engine) so it can be hard to see which control panel or file controls the configuration. Generally, this embedding has not been done by those with deep experience of search engines, so the defaults chosen are often wrong.

The balance of relevance versus recency is another setting which is often difficult to get right. At a news site we were asked to bias the order of results heavily in favour of recency (as the saying goes, yesterday’s newspaper is today’s chip wrapper) – the result being, as we had warned, that whatever the query today’s news would appear highest – even if it wasn’t relevant! Luckily by working with the client we managed to achieve a sensible balance before the site was launched.

Our approach is to strip back the configuration to a very basic one and to build on this, but only with good reason. Take out all the boosts and clever features and see how good the results are with the underlying algorithms (which have been developed based on decades of academic research – so don’t just break them with over-boosting). Create a process of test-based relevancy tuning where you can clearly relate a configuration setting to improving the result of a defined test. Be clear about which part of your system influences a setting and whose responsibility it is to change it, and record the changes in source control.

Boosts are a powerful tool – when used correctly – but you should start by turning them off, as they may well be doing more harm than good. Let us know if you’d like us to help tune your search!

The post Boosts Considered Harmful – adventures with badly configured search appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/08/19/boosts-considered-harmful-adventures-badly-configured-search/feed/ 1
Measuring search relevance scores http://www.flax.co.uk/blog/2016/04/19/measuring-search-relevance-scores/ http://www.flax.co.uk/blog/2016/04/19/measuring-search-relevance-scores/#respond Tue, 19 Apr 2016 09:23:41 +0000 http://www.flax.co.uk/?p=3220 A series of blogs by Karen Renshaw on improving site search: How to get started on improving Site Search Relevancy A suggested approach to running a Site Search Tuning Workshop Auditing your site search performance Developing ongoing search tuning processes … More

The post Measuring search relevance scores appeared first on Flax.

]]>
A series of blogs by Karen Renshaw on improving site search:

  1. How to get started on improving Site Search Relevancy
  2. A suggested approach to running a Site Search Tuning Workshop
  3. Auditing your site search performance
  4. Developing ongoing search tuning processes
  5. Measuring search relevance scores


In my last blog I talked about creating a framework for measuring search relevancy scores. In this blog I’ll show how this measurement can be done with a new tool, Quepid.

As I discussed, it’s necessary to record scores assigned to each search result based on how well that result answers the original query. Having this framework in place is necessary to ensure that you avoid the ‘see-saw’ effect of fixing one query but breaking many others further down the chain.

The challenge with this is the time taken to re-score queries once configuration changes have been made – especially given you could be testing thousands of queries.

That’s why it’s great to see a tool like Quepid now available. Quepid sits on top of open source search engines Apache Solr and Elasticsearch (it can also incorporate scores from other engines, which is useful for comparison purposes if you are migrating) and it automatically recalculates scores when configuration changes are made, thus reducing the time taken to understanding the impact of your changes.

Business and technical teams benefit

Quepid is easy to get going with. Once you have set up and scored an initial set of search queries (known as cases), developers can tweak configurations within the Quepid Sandbox (without pushing to live) and relevancy scores are automatically recalculated enabling business users to see changes in scores immediately.

This score, combined with the feedback from search testers, provides the insight into how effective the change has been – removing uncertainty about whether you should publish the changes to your live site.

Improved stakeholder communication

Having figures that shows how search relevancy is improving is also a powerful tool for communicating search performance to stakeholders (and helps to overcome those HIPPO and LIPPO challenges I’ve mentioned before too). Whilst a relevancy score itself doesn’t translate to a conversion figure, understanding how your queries are performing could support business cases and customer metric scores.

Test and Learn

As the need to manually re-score queries is removed, automated search testing is possible and combined with greater collaboration and understanding across the entire search team means that the test and learn process is improved.

Highly Customisable

Every organisation has a different objective when it comes to improving search, but Quepid is designed so that it can support your organisation and requirements:

  • Choose from a range of available scorers or create your own
  • Set up multiple cases so that you can quickly understand how different types of queries perform
  • Share cases amongst users for review and auditing
  • Download and export cases and scores
  • Assist with a ‘deep dive’ into low scoring queries
  • Identify if there are particular trends or patterns you need to focus on as part of your testing
  • Create a dashboard to share with category managers and other stakeholders

Flax are the UK resellers for Quepid, built by our partners OpenSource Connections – contact us for a demo and free 30-day trial.


Karen Renshaw is an independent On Site Search consultant and an associate of Flax. Karen was previously Head of On Site Search at RS Components, the world’s largest electronic component distributor.

Flax can offer a range of consulting, training and support, provide tools for test-driven relevancy tuning and we also run Search Workshops. If you need advice or help please get in touch.

The post Measuring search relevance scores appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/19/measuring-search-relevance-scores/feed/ 0
Developing ongoing search tuning processes http://www.flax.co.uk/blog/2016/04/13/developing-ongoing-search-tuning-processes/ http://www.flax.co.uk/blog/2016/04/13/developing-ongoing-search-tuning-processes/#respond Wed, 13 Apr 2016 09:39:33 +0000 http://www.flax.co.uk/?p=3195 A series of blogs by Karen Renshaw on improving site search: How to get started on improving Site Search Relevancy A suggested approach to running a Site Search Tuning Workshop Auditing your site search performance Developing ongoing search tuning processes … More

The post Developing ongoing search tuning processes appeared first on Flax.

]]>
A series of blogs by Karen Renshaw on improving site search:

  1. How to get started on improving Site Search Relevancy
  2. A suggested approach to running a Site Search Tuning Workshop
  3. Auditing your site search performance
  4. Developing ongoing search tuning processes
  5. Measuring search relevance scores

 


In my last blog I wrote about how to create an audit of your current site search performance. In this blog I cover how to develop search tuning processes.

Once started on your search tuning journey developing ongoing processes is a must. Search tuning is an iterative process and must be treated as such. In the same way that external search traffic – PPC and SEO – is continually reviewed and optimised, so must on site search be: otherwise you have invested a lot of time and money to get people to your site but then leave them wandering aimlessly in the aisles wondering if you have the product or information you so successfully advertised!

There are 2 key areas to focus on when developing search processes:

  1. Ongoing review of search performance
  2. Dedicated resource

1. Ongoing review of search performance

Develop a framework for measuring relevancy scores

It’s good practice to develop a benchmark as to how search queries are performing through creating a search relevancy framework. Simply put, this is a score assigned to each search result based on how well that result answers the original query.

You can customise the scoring system you use to score your search results. Whatever you choose the key is to ensure that your search analysts are consistent in their approach, the best way to achieve that is through providing documented guidelines.

Understanding how query scores change with different configurations is an integral part of search tuning process but you should also run regular reviews on how queries are performing. This way you’ll know the impact loading new documents and products into your site is having on overall relevancy and highlight changes you need to feed into your product backlog.

Process for manually optimising important or problematic queries

Even with a search tuning test and learn plan in place there will be some queries that don’t do as well as well as expected or for which a manual custom build response provides a better customer experience.

Whilst manually tuning a search can sometimes be viewed in a negative light – after all search should ‘just work’ – it shouldn’t be seen as such. Manually optimising important search queries means that you can provide a tailored response for your customer. The queries you optimise will be dependent on your metrics and what you deem as being a good or bad experience.

With manual optimisation you can should also build in continual reviews and take the opportunity to test different landing pages.

Competitive review

I’ve talked about this in a few of my other blogs but it is especially important for eCommerce sites to understand how your competitors are answering your customers’ queries. As you create a search relevancy framework for your site it’s easy to score the same queries on your competitors to draw out any comparisons and understand opportunities for improvements.

2. Dedicated Resource

Creating and maintaining the above reviews needs resource. Ideally you would have a staff member dedicated to reviewing search and responsible for updating product backlog configuration changes, working alongside developers to ensure changes are tested and deployed successfully.

If you don’t have a dedicated person responsible, the right skills will undoubtedly exist within your organisation. You will have teams who understand your product / information set, and within that team you will find a sub-set of individuals who have problem solving skills combined with a passion to improve the customer experience. Once you’ve found them, providing them with some light search knowledge will be enough to get you started.

Whether it’s a full-time role or part-time having someone focus on reviewing search queries should be part of your plan.

What’s next?

Now you have processes and a team in place it’s time to consider what to measure (and how). In my next blog I’ll cover how to measure search relevancy scores.

Karen Renshaw is an independent On Site Search consultant and an associate of Flax. Karen was previously Head of On Site Search at RS Components, the world’s largest electronic component distributor.

Flax can offer a range of consulting, training and support, provide tools for test-driven relevancy tuning and we also run Search Workshops. If you need advice or help please get in touch.

The post Developing ongoing search tuning processes appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/13/developing-ongoing-search-tuning-processes/feed/ 0
Auditing your site search performance http://www.flax.co.uk/blog/2016/04/08/auditing-site-search-performance/ http://www.flax.co.uk/blog/2016/04/08/auditing-site-search-performance/#respond Fri, 08 Apr 2016 09:22:23 +0000 http://www.flax.co.uk/?p=3171 A series of blogs by Karen Renshaw on improving site search: How to get started on improving Site Search Relevancy A suggested approach to running a Site Search Tuning Workshop Auditing your site search performance Developing ongoing search tuning processes … More

The post Auditing your site search performance appeared first on Flax.

]]>
A series of blogs by Karen Renshaw on improving site search:

  1. How to get started on improving Site Search Relevancy
  2. A suggested approach to running a Site Search Tuning Workshop
  3. Auditing your site search performance
  4. Developing ongoing search tuning processes
  5. Measuring search relevance scores


In my last blog I wrote in depth about how to run a search workshop. In this blog I cover how to create an audit of your current site search performance.

When starting on your journey to improve on site search relevancy investing time in understanding current performance is essential. Whilst the level of detail available to you will vary by industry and business, there are multiple sources of information you will have access to that provide insight. Combining these will ensure you have a holistic view of your customers experience.

The main sources of information to consider are:

  • Web Analytics
  • Current Relevancy Scores
  • Customer Feedback
  • Known Areas of Improvement
  • Competitors

Web Analytics

The metrics you use will be dependent upon your business, the role that search plays on your site and what you currently measure. What is key is to develop a view of how core search queries are performing. Classifying and creating an aggregated view of performance for different search queries allows you to identify any differences by search type, which you might want to focus on as part of your testing.

This approach also helps to prevent reacting to the opinions of HIPPOS and LIPPOS (Highest Paid Persons and Loudest Paid Persons Opinions) when constructing test matrixes.

Another measure to consider is zero results – what percentage of your search queries lead customers to a see the dreaded ‘no results found’ message. Don’t react to the overall percentage as a figure per se (a low percentage could mean too many irrelevant results are being returned, a high percentage that you don’t have the product / information your customers are looking for). Again what you’re trying to get to is an understanding of the root cause so you can build changes into your overall test plan. It’s a manual process but even a review of the top 200 zero results will throw up meaningful insights.

Current Relevancy Scores

Very closely linked to Web Analytics is a view of current search relevancy scores. It’s good practice to develop a benchmark as to how search queries are performing through creating a search relevancy framework. Simply put, this is a score assigned to each search result based on how well that result answers the original query.

Use queries from your search logs so you know you are scoring the queries important to your customers (and not just those important to those HIPPO’s and LIPPO’s). And whilst scoring will always be subjective providing guidelines to your testers helps mitigate this.

Tools like Quepid, which can sit on top of open source search engines Apache Solr and Elasticsearch (and also incorporate scores from other engines) and automatically recalculate scores when configuration changes are made, can support ongoing search tuning processes.

Customer Feedback

Whether in the form of structured or unstructured feedback, with site search critical to a successful customer experience, your customers will undoubtedly be providing you with a wealth of feedback.

Take the time to read through as much of it as possible. Even better, walk through some of the journeys yourself to understand the experience from the eyes of your customers. Whilst the feedback might be vague you’ll quickly find you can classify and pull out key themes.

Internal customer service departments can also provide you with customer logs and real life scenarios. Involving them up front to identify problem areas can help in the long term as they can be an invaluable resource when testing different search set ups.

Known Areas of Improvement

You’ve probably already got a list of search configurations on your backlog you want to review and test. Your developers will too, as will multiple teams across the organisation. Pulling together all these different views can provide a useful perspective on how to tackle problem areas.

Whilst you need to develop your search strategy based on customers needs (not just what other people like) it’s always useful to have sight of what search functionality exists that has helped them to find the right product, so capture these as you go along.

Competitor Review

A very important question for e-commerce sites is how are your competitors answering common queries? As you have for your own site, scoring common search queries across multiple sites provides a view of how you fare compared to your competitors.

Getting Started!

Now you have all this insight you can start to build out your search test plans with your developers. In my next blog I’ll cover how to start developing search tuning processes.

Karen Renshaw is an independent On Site Search consultant and an associate of Flax. Karen was previously Head of On Site Search at RS Components, the world’s largest electronic component distributor.

Flax can offer a range of consulting, training and support, provide tools for test-driven relevancy tuning and we also run Search Workshops. If you need advice or help please get in touch.

The post Auditing your site search performance appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/04/08/auditing-site-search-performance/feed/ 0
A suggested approach to running a Site Search Tuning Workshop http://www.flax.co.uk/blog/2016/03/24/suggested-approach-running-site-search-tuning-workshop/ http://www.flax.co.uk/blog/2016/03/24/suggested-approach-running-site-search-tuning-workshop/#respond Thu, 24 Mar 2016 15:25:15 +0000 http://www.flax.co.uk/?p=3159 A series of blogs by Karen Renshaw on improving site search: How to get started on improving Site Search Relevancy A suggested approach to running a Site Search Tuning Workshop Auditing your site search performance Developing ongoing search tuning processes … More

The post A suggested approach to running a Site Search Tuning Workshop appeared first on Flax.

]]>
A series of blogs by Karen Renshaw on improving site search:

  1. How to get started on improving Site Search Relevancy
  2. A suggested approach to running a Site Search Tuning Workshop
  3. Auditing your site search performance
  4. Developing ongoing search tuning processes
  5. Measuring search relevance scores


In my last blog I talked about getting started on improving site search relevancy, including the idea of running a two-day initial workshop. In this blog I cover more detail around what the workshop looks like in terms of structure.

Your reason for improving on site search could be driven by migration to a new platform or a need to improve ‘business as usual’ performance. As such, the exact structure should be tailored to you. It’s also worth remembering that whilst the workshop is the starting point, to get the most from it you will need to spend time in advance to gather all the relevant information you’ll need.

Workshop Overview

Objectives : Spend 30 mins at the start of the day to ensure that that the objectives (for workshop and overall project) are communicated and agreed across the entire project team.

Review the current search set up

It might seem wasteful to spend time reviewing your current set up – especially if you are moving to a new search platform – but ensuring everyone understands what and why you have the set up you have today is essential when designing future state.

It’s useful to break this session further into a Technical Set Up and Business Process. This helps to uncover if there are:

  • Particular search cases that you have developed workarounds for and which you need to protect revenue for – your intent will be to remove these workarounds but do you need to be aware they exist
  • Changes to your content model or content systems that you need to take into consideration
  • Technical constraints that you had in the past that are now gone

Ensuring a common level of understanding helps as the project moves forward.

Review current performance

Ensuring that the team knows how search queries are currently performing again increases buy in and engagement and provides a benchmark against which changes can be measured.

Your metrics will be dependent upon your business and what you currently measure (if you aren’t measuring anything – this would also be a good time to plan out what you should).

Classifying the types of search queries your customers are using is also important: do customers search predominately for single keywords, lengthy descriptors or part numbers? Whilst getting to this level of detail involves manual processes it not only provide a real insight into how your customers formulate queries but helps to avoid the ‘see-saw’ impact of focusing on fixes for some whilst unknowingly breaking others further down the tail.

Develop a search testing methodology

With the information to hand around current search set up and performance, now comes the fun part – figuring out the configuration set ups and tests you want to include as part of that new set up.

If you are migrating to a new platform, new approaches are possible, but if you’re working with existing technology there are opportunities to review and test current assumptions.

Search tuning is an iterative process: impacts of configuration changes are only understood once you start testing and determine if the results are as you expected, so build this into the plan from the start.

Dependent upon timescales and objectives you might chose to make wholescale changes immediately or you might decide to make a series of small changes to be able to test and measure each of them independently. Whichever option is best for you, measuring and tracking changes to your search relevancy scores are critical, tools such as Quepid make this possible (it’s also a great tool for building those collaborative working practices which are so important).

Whilst the focus is around improving search relevancy, excellent search experiences are achieved as a result of the holistic user experience, so remember to consider your UX strategy alongside your search relevancy strategy.

Making plans

Alongside clearly defined objectives you should aim to end the workshop with clearly defined action plans. The level of detail you capture and maintain again depends on your needs but as a minimum you should have mapped out:

  • Initial Configuration Tests
  • Test Search Queries
  • Test Team
  • Ongoing project management (Stand Ups / Project Reviews)

In my next blog I’ll write in more detail about how to audit your current and future search performance.

Karen Renshaw is an independent On Site Search consultant and an associate of Flax. Karen was previously Head of On Site Search at RS Components, the world’s largest electronic component distributor.

Flax can offer a range of consulting, training and support, provide tools for test-driven relevancy tuning and we also run Search Workshops. If you need advice or help please get in touch.

 

The post A suggested approach to running a Site Search Tuning Workshop appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/24/suggested-approach-running-site-search-tuning-workshop/feed/ 0
How to get started on improving Site Search Relevancy http://www.flax.co.uk/blog/2016/03/18/get-started-improving-site-search-relevancy/ http://www.flax.co.uk/blog/2016/03/18/get-started-improving-site-search-relevancy/#respond Fri, 18 Mar 2016 12:01:59 +0000 http://www.flax.co.uk/?p=3146 A series of blogs by Karen Renshaw on improving site search: How to get started on improving Site Search Relevancy A suggested approach to running a Site Search Tuning Workshop Auditing your site search performance Developing ongoing search tuning processes … More

The post How to get started on improving Site Search Relevancy appeared first on Flax.

]]>
A series of blogs by Karen Renshaw on improving site search:

  1. How to get started on improving Site Search Relevancy
  2. A suggested approach to running a Site Search Tuning Workshop
  3. Auditing your site search performance
  4. Developing ongoing search tuning processes
  5. Measuring search relevance scores

 


You know your search experience isn’t working – your customers, your colleagues, your bosses are telling you – you know you need to fix it, fix something but where do you start?

Understanding and improving search relevancy can often feel like a never ending journey and it’s true – tuning search is not a one-off hit – it’s an iterative ongoing process that needs investment. But the resources, companies and tools needed to support you are available.

Here, I’ll take a quick look at how to get started on your search tuning journey. I’ll be following up in subsequent blog posts with more details of each step.

Getting Started

Like any project, to be successful you need to understand what you want to achieve. The best way is to kick off the process with a multi-functional Search Workshop.

Typically ran over 2 days, this workshop is designed to identify what to focus on and how. It becomes the key to developing ongoing search tuning processes and driving collaborative working across teams.

Workshop Agenda

Whilst the agenda can be adapted to be specific to your organisation, in the main there are 4 key stages to it:

  1. Audit
  2. Define
  3. Testing Approach
  4. Summary

1. Audit – Where are we are now?

Spend time understanding in depth what the issues are. There are many sources of information you can call on:

  • Web Analytics – How are queries performing today?
  • Customer Feedback – What are the key areas that your customers complain about?
  • Known Areas of Improvement – What’s already on your product backlog?
  • Competitive Review – Very important for eCommerce sites – how are your competitors responding to your customers queries?

2. Define – Where do we want to be?

As a team agree what the objectives for the project are:

  • What are the issues you want to address?
  • Are there specific types of search queries you want to focus on?
  • Is a overhaul of all search queries something you want to achieve?
  • What are the technical opportunities you haven’t yet exploited?

3. Testing Approach – What’s the plan of attack?

This is the time to plan out what changes you will make and what methodology for testing and deployment you are going to use.

  • What order should you make your configuration changes in?
  • Are there any constraints / limitations you need to plan around?
  • What resources do you need to support search configuration testing?
  • How are you going to measure and track your changes so you know they are successful?
  • Do you need to build in a communication plan for stakeholders?

4. Summary

Ensure that all actions are captured in a project plan with clear owners and timescales.

Workshop Attendees

Within an organisation multiple teams have responsibility for making search better, so at a minimum a subject matter expert from each team should attend.

Key attendees:

  • Business Owner
  • Search Developer
  • Content Owner
  • Web Analyst

Benefits of the workshop

There are practical and cultural benefits to approaching search in this way:

  • Collaborative working practices across the different disciplines are improved
  • Shared objectives and issues leads to better engagement and understanding of the approach
  • A test and learn approach can be developed with the time between testing iterations reduced
  • The workshop itself is an indicator to the wider business that search is now a key strategic priority and that it is getting the love and attention it needs

In my next blog I’ll cover how to run the workshop in more detail.

Karen Renshaw is an independent On Site Search consultant and an associate of Flax. Karen was previously Head of On Site Search at RS Components, the world’s largest electronic component distributor.

Flax can offer a range of consulting, training and support, provide tools for test-driven relevancy tuning and we also run Search Workshops. If you need advice or help please get in touch.

The post How to get started on improving Site Search Relevancy appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/18/get-started-improving-site-search-relevancy/feed/ 0
XJoin for Solr, part 2: a click-through example http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/ http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/#respond Fri, 29 Jan 2016 09:39:00 +0000 http://www.flax.co.uk/?p=2941 In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use … More

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

]]>
In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use XJoin to make use of user click-through data to influence the score of products in searches.

I’ll step through things a bit quicker this time around and I’ll be using code from the last post so reading that first is highly recommended. I’ll assume that the prerequisites from last time have been installed and set up in the same directories.

The design

Suppose we have a web page for searching a collection of products, and when a user clicks on product listing in the result set (or perhaps, when they subsequently go on to buy that product – or both) we insert a record in an SQL database, storing the product id, the query terms they used, and an arbitrary weight value (which will depend on whether they merely clicked on a result, or if they went on to buy it, or some other behaviour such as mouse pointer tracking). We then want to use the click-through data stored in that database to boost products in searches that use those query terms again.

We could use the sum of the weights of all occurrences of a product id/query term combination as the product score boost, but then we might start to worry about a feedback process occurring. Alternatively, we might take the maximum or average weight across the occurrences. In the code below, we’ll use the maximum.

The advantage of this design over storing the click-through information in Solr is that you don’t have to update the Solr index every time there is user activity, which could become costly. An SQL database is much more suited to this task.

The external click-through API

Again, we’ll be using Python 3 (using the flask and sqlite3 modules) to implement the external API. I’ll be using this API to update the click-through database (by hand, for this example) as well as having Solr query it using XJoin. Here’s the code (partly based on code taken from here for caching the database connection in the Flask application context, and see here if you’re interested in more details about sqlite3’s support for full text search). Again, all the code written for this example is also available in the BioSolr GitHub repository:

from flask import Flask, request, g
import json
import sqlite3 as sql

# flask application context attribute for caching database connection
DB_APP_KEY = '_database'

# default weight for storing against queries
DEFAULT_WEIGHT = 1.0

app = Flask(__name__)

def get_db():
  """ Obtain a (cached) DB connection and return a cursor for it.
  """
  db = getattr(g, DB_APP_KEY, None)
  if db is None:
    db = sql.connect('click.db')
    setattr(g, DB_APP_KEY, db)
    c = db.cursor()
    c.execute("CREATE VIRTUAL TABLE IF NOT EXISTS click USING fts4 ("
                "id VARCHAR(256),"
                "q VARCHAR(256),"
                "weight FLOAT"
              ")")
    c.close()
  return db

@app.teardown_appcontext
def teardown_db(exception):
  db = getattr(g, DB_APP_KEY, None)
  if db is not None:
    db.close()

@app.route('/')
def main():
  return 'click-through API'

@app.route('/click/<path:id>', methods=["PUT"])
def click(id):
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  try:
    w = float(request.args.get('weight', DEFAULT_WEIGHT))
  except ValueError:
    return 'Could not parse weight', 400

  # do the DB update
  db = get_db()
  try:
    c = db.cursor()
    c.execute("INSERT INTO click (id, q, weight) VALUES (?, ?, ?)", (id, q, w))
    db.commit()
    return 'OK'
  finally:
    c.close()

@app.route('/ids')
def ids():
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  
  # do the DB lookup
  try:
    c = get_db().cursor()
    c.execute("SELECT id, MAX(weight) FROM click WHERE q MATCH ? GROUP BY id", (q, ))
    return json.dumps([{ 'id': id, 'weight': w } for id, w in c])
  finally:
    c.close()

if __name__ == "__main__":
  app.run(port=8001, debug=True)

This web API exposes two end-points. First we have PUT /click/[id] which is used when we want to update the SQL database after a user click. For the purposes of this demonstration, we’ll be hitting this end-point by hand using curl to avoid having to write a web UI. The other end-point, GET /ids?[query terms], is used by our XJoin component and returns a JSON-formatted array of id/weight objects where the query terms from the database match those given in the query string.

Java glue code

Now we just need the Java glue code that sits between the XJoin component and our external API. Here’s an implementation of XJoinResultsFactory that does what we need:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class ClickXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
  }

  /**
   * Use 'click' REST API to fetch current click data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    String q = URLEncoder.encode(params.get("q"), "UTF-8");
    String apiUrl = url + "?q=" + q;
    try (HttpConnection http = new HttpConnection(apiUrl)) {
      JsonArray products = (JsonArray)http.getJson();
      return new ClickResults(products);
    }
  }
    
  public class ClickResults implements XJoinResults {
    private Map<String, Click> clickMap;
    
    public ClickResults(JsonArray products) {
      clickMap = new HashMap<>();
      for (JsonValue product : products) {
        JsonObject object = (JsonObject)product;
        String id = object.getString("id");
        double weight = object.getJsonNumber("weight").doubleValue();
        clickMap.put(id, new Click(id, weight));
      }
    }
    
    public int getCount() {
      return clickMap.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      return clickMap.keySet();
    }

    @Override
    public Object getResult(String id) {
      return clickMap.get(id);
    }      
  }
  
  public class Click {
    
    private String id;
    private double weight;
    
    public Click(String id, double weight) {
      this.id = id;
      this.weight = weight;
    }
    
    public String getId() {
      return id;
    }
    
    public double getWeight() {
      return weight;
    } 
  }
}

Unlike the previous example, this time getResults() does depend on the SolrParams argument, so that the user’s query, q, is passed to the external API. Store this Java source in blog/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java and compile into a JAR (again, we also need the HttpConnection class from the last blog post as well as javax.json-1.0.4.jar):

blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java
blog$ jar cvf click.jar -C bin .

Solr configuration

Starting with a fresh version of solrconfig.xml, insert these lines near the start to import the XJoin and user JARs (substitute /XXX with the full path to the parent of the blog directory):

<lib dir="${solr.install.dir:../../../..}/contrib/xjoin/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-xjoin-\d.*\.jar" />
<lib path="/XXX/blog/javax.json-1.0.4.jar" />
<lib path="/XXX/blog/click.jar" />

And our request handler configuration:

<queryParser name="xjoin" class="org.apache.solr.search.xjoin.XJoinQParserPlugin" />

<valueSourceParser name="weight" class="org.apache.solr.search.xjoin.XJoinValueSourceParser">
  <str name="attribute">weight</str>
  <double name="defaultValue">0.0</double>
</valueSourceParser>

<searchComponent name="x_click" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.ClickXJoinResultsFactory</str>
  <str name="joinField">id</str>
  <lst name="external">
    <str name="url">http://localhost:8001/ids</str>
  </lst>
</searchComponent>

<requestHandler name="/xjoin" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="wt">json</str>
    <str name="echoParams">none</str>
    <str name="defType">edismax</str>
    <str name="df">description</str>
    <str name="fl">*</str>

    <bool name="x_click">false</bool>
    <str name="x_click.results">count</str>
    <str name="x_click.fl">*</str>
  </lst>
  <arr name="first-components">
    <str>x_click</str>
  </arr>
  <arr name="last-components">
    <str>x_click</str>
  </arr>
</requestHandler>

Reload the Solr core (products) to get the new config in place.

Putting the pieces together

The following query will verify our Solr setup (remembering to escape curly brackets):

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&fl=id,name,score&rows=4' | jq .

I’ve used Solr parameter substitution with the q/qq parameters which will simplify later queries (this has been in Solr since 5.1). This query returns:

{
  "responseHeader": {
    "status": 0,
    "QTime": 25
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 2.9939778,
    "docs": [
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.8712361
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 1.8712361
      }
    ]
  }
}

Some repeat products in the data, but so far, so good. Next, get the click-through API running:

blog$ python3 click.py

And check it’s working (this should return [] whatever query is chosen because the click-through database is empty):

curl localhost:8001/ids?q=software | jq .

Now, let’s populate the click-through database by simulating user activity. Suppose, given the above product results, the user goes on to click through to the fourth product (or even buy it). Then, the UI would update the click web API to indicate this has happened. Let’s do this by hand, specifying the product id, the user’s query, and a weight score (here, I’ll use the value 3, supposing the user bought the product in the end):

curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/9200068133591804002?q=excel&weight=3'

Now, we can check the output that XJoin will see when using the click-through API:

blog$ curl localhost:8001/ids?q=excel | jq .

giving:

[
  {
    "weight": 3,
    "id": "http://www.google.com/base/feeds/snippets/9200068133591804002"
  }
]

Using the bf edismax parameter and the weight function set up in solrconfig.xml to extract the weight value from the external results stored in the x_click XJoin search component, we can boost product scores when they appear in the click-through database for the user’s query:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

which gives:

{
  "responseHeader": {
    "status": 0,
    "QTime": 13
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 3.2224145,
    "docs": [
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 3.2224145
      },
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.5559989
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
          "weight": 3
        }
      }
    ]
  }
}

Lo and behold, the product the user clicked on now appears top of the Solr results for the that query. Have a play with the API, generate some more user activity and see how this effects subsequent queries. It will cope fine with multiple-word queries, for example, suppose a user searches for ‘games software’:

curl 'http://localhost:8983/solr/products/xjoin?qq=games+software&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

There being no relevant queries in the click-through database, this has the same results as for a query without the XJoin, and as we can see, the value of response.x_click.count is 0:

{
  "responseHeader": {
    "status": 0,
    "QTime": 15
  },
  "response": {
    "numFound": 1158,
    "start": 0,
    "maxScore": 0.91356516,
    "docs": [
      {
        "name": "encore software 10568 - encore hoyle puzzle & board games 2005 - complete product - puzzle game - 1 user - complete product - standard - pc",
        "id": "http://www.google.com/base/feeds/snippets/4998847858583359731",
        "score": 0.91356516
      },
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 0.8699497
      },
      {
        "name": "encore software 10027 - hoyle board games (win 98 me 2000 xp)",
        "id": "http://www.google.com/base/feeds/snippets/8664755713112971171",
        "score": 0.85982025
      },
      {
        "name": "encore software 11253 - brain food games: cranium collection 2006 sb cs by encore",
        "id": "http://www.google.com/base/feeds/snippets/15401280256033043239",
        "score": 0.78744644
      }
    ]
  },
  "x_click": {
    "count": 0,
    "external": []
  }
}

Now let’s simulate the same user clicking on the second product (with default weight):

blog$ curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/826668451451666270?q=games+software'

Next, suppose another user then searches for just ‘games’:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=games&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

In the results, we see the ‘wild games’ product boosted to the top:

{
  "responseHeader": {
    "status": 0,
    "QTime": 60
  },
  "response": {
    "numFound": 212,
    "start": 0,
    "maxScore": 1.3652229,
    "docs": [
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 1.3652229
      },
      {
        "name": "xbox 360: ddr universe",
        "id": "http://www.google.com/base/feeds/snippets/16659259513615352372",
        "score": 0.95894843
      },
      {
        "name": "south park chef's luv shack",
        "id": "http://www.google.com/base/feeds/snippets/11648097795915093399",
        "score": 0.95894843
      },
      {
        "name": "egames. inc casual games pack",
        "id": "http://www.google.com/base/feeds/snippets/16700933768709687512",
        "score": 0.89483213
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
          "weight": 1
        }
      }
    ]
  }
}

Extensions

Of course, this approach can be extended to add in more sophisticated weighting and boosting strategies, or include more data about the user activity than just a simple weight score, which could be used to augment the display of the product in the UI (for example, “ten customers in the UK bought this product in the last month”).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/feed/ 0
XJoin for Solr, part 1: filtering using price discount data http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/ http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/#comments Mon, 25 Jan 2016 10:04:28 +0000 http://www.flax.co.uk/?p=2928 In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an … More

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

The XJoin component can be used when you want values from some source external to Solr to filter or influence the score of hits in your Solr result set.  It is currently available as a Solr patch on the XJoin JIRA ticket SOLR-7341, so to use it, you’ll need to check out a version of Apache Lucene/Solr using Subversion, then patch and build it (see below for details).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

Patching SOLR

I’m going to be using Solr version 5.3 for this blog. If you’re following along, check out a clean copy using Subversion:

$ svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3

Download the XJoin patch (find the one corresponding to this version of Solr on the JIRA ticket) into the newly checked-out directory, and apply it:

lucene_solr_5_3$ svn patch SOLR-7341.patch-5_3

And then build Solr from the solr sub-directory:

lucene_solr_5_3/solr$ ant server

We should now be able to start the patched Solr server:

lucene_solr_5_3/solr$ bin/solr start

Indexing a sample product data set

I’ll be using a sample Google product feed, GoogleProducts.csv, which I got from here. Create a new directory called blog (mine has the same parent as my Solr check-out) and download the sample into it. It’s in CSV format, with columns for product id, name, description, manufacturer and price. Indexing this will be a piece of cake!

We’ll begin with a copy of the sample Solr config directory:

blog$ cp -r ../lucene_solr_5_3/solr/server/solr/configsets/basic_configs/conf .

Modify conf/schema.xml so that our Solr documents have fields corresponding to those in the CSV file:

<field name="id" type="string" indexed="true" /> 
<field name="name" type="text_en" indexed="true" />
<field name="description" type="text_en" indexed="true" />
<field name="manufacturer" type="string" indexed="true" />
<field name="price" type="float" indexed="true" />

Naturally, the product id will serve as the Solr unique key:

<uniqueKey>id</uniqueKey>

We can use the sample solrconfig.xml as is for now. Add a core called products using the Solr core admin UI (as you started a Solr server above, this should be available at   http://localhost:8983/solr/#/~cores). The values for instanceDir and dataDir will both be the full path of the blog directory.

I’ll be using Python to index the product data. The code is written for Python 3, and won’t work in Python 2.x because of character encoding issues in the csv module, but you can fix it by using a UTF8Recoder as described in the module documentation. Here’s my indexing script (note that all the code written for this example is also available in the BioSolr GitHub repository):

import sys
import csv
import json
import requests

def value(k, v):
    return k, v.strip() if k != 'price' else float(v.split()[0])

def read(path):
    with open(path, encoding='iso-8859-1') as f:
        reader = csv.DictReader(f)
        for doc in reader:
            yield dict(value(k, v) for k, v in doc.items()
                       if len(v.strip()) > 0)

def index(url, docs):
    print("Sending {0} documents to {1}".format(len(docs), url))
    data = json.dumps(docs)
    headers = { 'content-type': 'application/json' }
    r = requests.post(url, data=data, headers=headers)
    if r.status_code != 200:
      raise IOError("Bad SOLR update")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {0} <Solr update URL> <CSV file>".format(sys.argv[0]))
        sys.exit(1)

    docs = list(read(sys.argv[2]))
    index(sys.argv[1], docs)

The script tidies up the prices because they aren’t consistently formatted, converting them to float values. Save the script in index.py and use it to index the Google product data into Solr (let’s force commits, just to be sure):

blog$ python3 index.py http://localhost:8983/solr/products/update?commit=true GoogleProducts.csv

And, lo and behold, we can see our data in Solr using cURL (I like to pipe the output through jq to get nicely formatted JSON):

curl 'localhost:8983/solr/products/select?wt=json&q=*' | jq .

So, using Solr we’ve now built a full text product search in only a few minutes, with potentially all the add-ons Solr provides out of the box. However, suppose there is supplementary information about the products, available from an external source (which might not be under our control).

I will now demonstrate how to configure Solr so that during a product search, the external source is also queried (either with the same user query or something different) and the resulting external data used to influence the result set. Each external result is ‘joined’ against a Solr document via a ‘join field’ or ‘join id’, which doesn’t have to be the Solr unique id (in the examples below I use the product id and manufacturer as the join fields). To get an ‘inner join’ I will use the XJoinQParserPlugin to turn the external ids into a filter query, but it’s also possible to build boost queries or use the XJoinValueSourceParser to use external values in a boost function. You can see all this implemented below.

Product discount offers example

In the first of my examples, I’ll set up filtering and score boosting based on discount offers, the external source for which is going to be a web service, which I’m going to make available locally on the URL http://localhost:8000/offers.  Again, I’ll implement this in Python, using the popular Flask web server micro-framework and the module requests.  Install both of these using pip (I need sudo, but you might not):

blog$ sudo pip install flask requests

Creating the external source

Here’s my code for the product offers web API:

from flask import Flask
from index import read
import json
import random
import sys

app = Flask(__name__)

@app.route('/')
def main():
    return json.dumps({ 'info': 'product offers API' })

@app.route('/products')
def products():
    offer = lambda doc: {
                'id': doc['id'],
                'discountPct': random.randint(1, 80)
            }
    return json.dumps([offer(doc) for doc
                       in random.sample(app.docs, 64)])

@app.route('/manufacturers')
def manufacturer():
  manufacturers = set(doc['manufacturer'] for doc in app.docs
                      if 'manufacturer' in doc)
  deal = lambda m: {
             'manufacturer': m,
             'discountPct': random.randint(1, 10) * 5
         }
  return json.dumps([deal(m) for m
                     in random.sample(manufacturers, 3)])

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("Usage: {0} <CSV file>".format(sys.argv[0]))
    sys.exit(1)

  app.docs = list(read(sys.argv[1]))
  app.run(port=8000, debug=True)

The code generates discounts for a random selection of products and manufacturers. Save it to blog/offer.py and start the server, supplying the Google products CSV file on the command line:

blog$ python3 offer.py GoogleProducts.csv

Now, test it out using cURL (again, I like to pipe through jq to get nicely formatted JSON):

$ curl -s localhost:8000/products | jq .

You should see a list of objects, each with a product id and a discount percentage, something like:

[
  {
    "discountPct": 41,
    "id": "http://www.google.com/base/feeds/snippets/18100341066456401733"
  },
  {
    "discountPct": 63,
    "id": "http://www.google.com/base/feeds/snippets/16969493842479402672"
  },
  {
    "discountPct": 13,
    "id": "http://www.google.com/base/feeds/snippets/10357785197400989441"
  },
  {
    "discountPct": 35,
    "id": "http://www.google.com/base/feeds/snippets/2813321165033737171"
  },
  {
    "discountPct": 27,
    "id": "http://www.google.com/base/feeds/snippets/15203735208016659510"
  },
  ...
]

You get similar output if you use the /manufacturers endpoint:

$ curl -s localhost:8000/manufacturers | jq .

This time, we get a shorter list, of manufacturers each with a discount percentage, for example:

[
  {
    "discountPct": 15,
    "manufacturer": "freeverse software"
  },
  {
    "discountPct": 5,
    "manufacturer": "pinnacle systems"
  },
  {
    "discountPct": 50,
    "manufacturer": "destineer inc"
  }
]

Creating XJoin glue code

To bridge the gap between Solr and our external data source, XJoin requires some glue code, written in Java, to query the source and return the results. First, I’ll create a quick utility class to help with HTTP connections:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.json.Json;
import javax.json.JsonReader;
import javax.json.JsonStructure;

public class HttpConnection implements AutoCloseable {
  private HttpURLConnection http;
  
  public HttpConnection(String url) throws IOException {
    http = (HttpURLConnection)new URL(url).openConnection();
  }
  
  public JsonStructure getJson() throws IOException {
    http.setRequestMethod("GET");
    http.setRequestProperty("Accept", "application/json");
    try (InputStream in = http.getInputStream();
         JsonReader reader = Json.createReader(in)) {
      return reader.read();
    }
  }
  
  @Override
  public void close() {
    http.disconnect();
  }
}

Save this as blog/java/uk/co/flax/examples/xjoin/HttpConnection.java. The glue code we need is fairly simple, and can be written as a single class, implementing the XJoinResultsFactory interface:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class OfferXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  private String field;
  private String discountField;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
    field = (String)args.get("field");
    discountField = (String)args.get("discountField");
  }

  /**
   * Use 'offers' REST API to fetch current offer data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    try (HttpConnection http = new HttpConnection(url)) {
      JsonArray offers = (JsonArray)http.getJson();
      return new OfferResults(offers);
    }
  }
   
  /**
   * Results of the external search - methods like getXXX() are used
   * to expose the property XXX in the SOLR results.
   */
  public class OfferResults implements XJoinResults {
    private JsonArray offers;
    
    public OfferResults(JsonArray offers) {
      this.offers = offers;
    }
    
    public int getCount() {
      return offers.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      List ids = new ArrayList<>();
      for (JsonValue offer : offers) {
        ids.add(((JsonObject)offer).getString(field));
      }
      return ids;
    }

    @Override
    public Object getResult(String joinIdStr) {
      for (JsonValue offer : offers) {
        String id = ((JsonObject)offer).getString(field);
        if (id.equals(joinIdStr)) {
          return new Offer(offer);
        }
      }
      return null;
    }
  }
  
  /**
   * A discount offer - methods like getXXX() are used to expose
   * properties that can be joined with each Solr result via the join
   * id field.
   */
  public class Offer {
    private JsonValue offer;
    
    public Offer(JsonValue offer) {
      this.offer = offer;
    }
    
    public double getDiscount() {
      return ((JsonObject)offer).getInt(discountField) * 0.01d;
    }
  }
}

Here, the init() method initialises the URL for the external API and the names of the values we want to pick out from the external data. The getResults() method connects to the external API – since in this example, the discounts do not depend on the user’s query, we don’t use the SolrParams argument at all. It returns an implementation of XJoinResults, which must be able to return a collection of join ids (so, the value of the join id field for each external result), and also be able to return an external result object given a join id. Together, the XJoinResults object and each external result object contain the results of the external search, exposed via getXXX() methods (which are mapped to properties called XXX) and (once everything is plumbed in) available to Solr for filtering, affected the scores of documents, or for inclusion in the results set.

Save the above as blog/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java. You’ll also need javax.json-1.0.4.jar, which you can download from here if you don’t already have it – place it in the blog directory. Compile the two Java source files, and create a JAR to contain the resulting .class files:

blog$ mkdir bin
blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java
blog$ jar cvf offer.jar -C bin .

Configuring XJoin

So now – at last! – I’ll configure a Solr query handler that uses the XJoin Solr plugin components to add filters and boost queries based on the external data.

I’ll be working with blog/conf/solrconfig.xml now. The first thing to do is include the contrib JARs for XJoin and our glue code JAR (offer.jar) in <lib> directives near the top of the config file. To do that, add in the following snippet just under the <dataDir> directive:

<lib dir="${solr.install.dir:../../../..}/contrib/xjoin/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-xjoin-\d.*\.jar" />
<lib path="/XXX/blog/javax.json-1.0.4.jar" />
<lib path="/XXX/blog/offer.jar" />

Here, you need to substitute /XXX with the full path to the parent of the blog directory.  (We need to include javax.json-1.0.4.jar because it’s a dependency of our offer.jar.) Now for the request handler config – I’ll include everything we’re going to need even though it won’t all be used straightaway:

<queryParser name="xjoin" class="org.apache.solr.search.xjoin.XJoinQParserPlugin" />

<valueSourceParser name="discount" class="org.apache.solr.search.xjoin.XJoinValueSourceParser">
  <str name="attribute">discount</str>
  <double name="defaultValue">0.0</double>
</valueSourceParser>

<searchComponent name="x_product_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">id</str>
  <lst name="external">
    <str name="url">http://localhost:8000/products</str>
    <str name="field">id</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<searchComponent name="x_manufacturer_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">manufacturer</str>
  <lst name="external">
    <str name="url">http://localhost:8000/manufacturers</str>
    <str name="field">manufacturer</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<requestHandler name="/xjoin" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="wt">json</str>
    <str name="echoParams">all</str>
    <str name="defType">edismax</str>
    <str name="df">description</str>
    <str name="fl">*</str>

    <bool name="x_product_offers">false</bool>
    <str name="x_product_offers.results">count</str>
    <str name="x_product_offers.fl">*</str>

    <bool name="x_manufacturer_offers">false</bool>
    <str name="x_manufacturer_offers.results">count</str>
    <str name="x_manufacturer_offers.fl">*</str>
  </lst>
  <arr name="first-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
  <arr name="last-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
</requestHandler>

Insert this request handler config somewhere near the bottom of solrconfig.xml.

Using XJoin in a query

Let’s quickly get a query working, then I’ll explain what all the components that I’ve included do. Try this (remembering to escape curly brackets on the command line):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&fq=\{!xjoin\}x_product_offers&fl=id,name&rows=4' | jq .

You should see output like this (I’ve edited responseHeader.params for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 22,
    "params": {
      "x_product_offers": "true", 
      "x_product_offers.results": "count",
      "x_product_offers.fl": "*", 
      "q": "*", 
      "fq": "{!xjoin}x_product_offers", 
      "fl": "id,name",
      "rows": "4"
    }
  },
  "response": {
    "numFound": 64,
    "start": 0,
    "docs": [
      {
        "name": "did0480p-m311 plasmon additional maintenance 24x7 - plasmon diamond technical support - consul",
        "id": "http://www.google.com/base/feeds/snippets/13522752516373728128"
      },
      {
        "name": "apple ilife '06 family pack",
        "id": "http://www.google.com/base/feeds/snippets/10939909441298262260"
      },
      {
        "name": "adobe cs3 web standard upsell",
        "id": "http://www.google.com/base/feeds/snippets/8042583218932085904"
      },
      {
        "name": "the richard friedman trio motown hits - *(for the tg-100)*",
        "id": "http://www.google.com/base/feeds/snippets/17853905518738313346"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13522752516373728128",
        "doc": {
          "discount": 0.11
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/10939909441298262260",
        "doc": {
          "discount": 0.76
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/8042583218932085904",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17853905518738313346",
        "doc": {
          "discount": 0.05
        }
      }
    ]
  }
}

Here you can see the usual Solr output with our product documents in the response.docs array. Notice the value of response.numFound is only 64 out of a possible 3226. Additionally, we have an extra section, response.x_product_offers, that gives us results from the external offers API – count tells us the total number of external results found, and there is an external result object with a join id matching each hit in the Solr results.

The query we made to get these results is a combination of the parameters in the request handler, and those in the URL’s query string – I’ve left the pertinent ones in responseHeader.paramsThe first parameter, x_product_offers=true, turns on the XJoin component that talks to the offers API, so that at query time, it will make a connection and retrieve external results (note that in this case, no parameters are passed to the external API – the following blog post will demonstrate this). The following two parameters control which fields are output from the external results – the .results option is a field list which controls the fields returned from the OfferResults object (that’s our implementation of XJoinResults – see the code above – there is one OfferResults object per external request and it acts as a collection of the returned external results). Then the .fl option is another field list which controls the fields returned for each external result object – these values can be used for filtering, boosting, and so on (for more on which, see below).

The parameters q=*, fl=id,name and rows=4 have their usual effects. The really interesting parameter is the filter query:

fq={!xjoin}x_product_offers

This uses Solr local parameters “short-form” syntax to reference the XJoinQParserPlugin that was set up in solrconfig.xml (it doesn’t take any initialisation parameters). This component uses the join ids from the referenced XJoin component to create a query that ORs together terms like join_field:join_id (one for each external result). It is based on the Solr built-in TermsQParserPlugin and supports the same method parameter (but this can usually be omitted). So, here, it makes a filter based on the join ids returned by the offers API – thus, only the products which have a current offer are returned.

Note that we could have used the same syntax in just the q parameter to achieve the same effect, but it’s more usual that a user full text query is specified in and a ‘join’ created using a filter query.

Using the XJoinValueSourceParser

The XJoinValueSourceParser component that we have configured in solrconfig.xml provides us with a function, discount, that we can use in a function query. I configured the component to extract the value of discount from external results, and we supply an XJoin component name as the argument – this is a reference to a set of external results.

This opens up lots of possibilities, for example, a search in which each product’s score is  boosted by a reciprocal function of the price including discount (so cheaper products, after discounting, are boosted higher):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2&fl=id,price,score&rows=4' | jq .

which results in a response something like (again, with responseHeader.params edited for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 55,
    "params": {
       "x_product_offers": "true", 
       "x_product_offers.results": "count", 
       "x_product_offers.fl": "*", 
       "q": "*", 
       "bf": "recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2",
       "fl": "id,price,score",
       "rows": "4"
     }
   },
  "response": {
    "numFound": 3226,
    "start": 0,
    "maxScore": 1.3371909,
    "docs": [
      {
        "id": "http://www.google.com/base/feeds/snippets/549551716004314019",
        "price": 0.5,
        "score": 1.3371909
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "price": 8.49,
        "score": 1.325241
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "price": 9.9,
        "score": 1.3166784
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/18427513736767114578",
        "price": 2.99,
        "score": 1.3156738
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "doc": {
          "discount": 0.71
        }
      }
    ]
  }
}

This time, because we haven’t applied on a filter based on the external join ids, we still have the full set of documents in the results set (3226 in total). Note that although there are 4 results in response.docs (as requested by rows=4), there are only 2 external results in x_product_offers.external – this is because only 2 of those 4 Solr documents have matching external results (in that they have the same value of join id in the join field, which in this case is the product id). In other words, only 2 out of the 4 products returned have discounts offered.

To achieve the price boost, instead of a filter query, we have a boost function:

bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2

For each Solr document in the results set, the value of the expression discount(x_product_offers) is found by calling getDiscount() on the matching external result  in the x_product_offers XJoin search component. When there is no matching external result, the default value 0.0 is used, as configured for the value source parser in solrconfig.xml, which is equivalent to a 0% discount.

Of course, instead of the match-all q=* query, we can do an actual product search with our price boost, for example, q=apple. To be more sophisticated, we can also use the edismax parameter qf to query across both the name and description fields and weight them as we desire, for example, qf=name^4 description^2 or similar.

Joining on a field other than the unique id field

The join field does not have to correspond to the Solr unique id field. As seen above, the offers web API also returns discounts based on manufacturer (using the /manufacturers end-point). I configured another XJoin search component in solrconfig.xml called x_manufacturer_offers, the only differences from x_product_offers being the join field, which is now manufacturer, and the field which is taken from the external results to be the join value, which is of course the same, manufacturer.

So, now for example we can do a weighted query for “games software”, but restricting to products that have a manufacturer discount of at least 20%:

blog$ curl 'localhost:8983/solr/products/xjoin?q=software&qf=name^4+description^2&x_manufacturer_offers=true&fq=\{!frange+l=0.2\}discount(x_manufacturer_offers)&fl=*&rows=4' | jq .

See FunctionRangeQParserPlugin for details of the filter query used in this search. This gives something like (responseHeader.params omitted this time):

{
  "responseHeader": {
    "status": 0,
    "QTime": 4
  },
  "response": {
    "numFound": 25,
    "start": 0,
    "maxScore": 1.1224447,
    "docs": [
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/7436299398173390476",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17001745805951209994",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/10584509515076384561",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329559531500
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17283219592038470822",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329681166300
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "freeverse software",
        "doc": {
          "discount": 0.2
        }
      }
    ]
  }
}

In this case, there was only one manufacturer represented in the requested top 4 rows of the Solr results set.

Using two XJoin components in the same query

It’s worth noting that you can use more than one XJoin component in the same query. You can come up with more complicated examples, but this one shows how to query for all products that have a manufacturer discount as well as a product discount:

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&x_manufacturer_offers=true&fq=\{!xjoin\}x_product_offers&fq=\{!xjoin\}x_manufacturer_offers&fl=id,name,manufacturer&rows=4&wt=json' | jq .

You might have to try again a few times before you get a non-empty result set – here’s one I got:

{
  "responseHeader": {
    "status": 0,
    "QTime": 7
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "name": "apple software m8789z/a webobjects 5.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/4776201646741876078"
      },
      {
        "name": "apple software m9301z/b soundtrack v1.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/16537637847870148950"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/4776201646741876078",
        "doc": {
          "discount": 0.59
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/16537637847870148950",
        "doc": {
          "discount": 0.22
        }
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "apple software",
        "doc": {
          "discount": 0.3
        }
      }
    ]
  }
}

So you can see that there are two external results sections, one for product offers and one for manufacturer offers, and how the offers are matched to the products by the join ids (which is either the product id, or the manufacturer).

Next time…

In my next blog post, I’ll dive in to another demonstration of XJoin, in which I show how to use click-through data to influence the score of subsequent searches.

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/feed/ 5
Out and about in search & monitoring – Autumn 2015 http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/ http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/#respond Wed, 16 Dec 2015 10:24:42 +0000 http://www.flax.co.uk/?p=2857 It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress … More

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
It’s been a very busy few months for events – so busy that it’s quite a relief to be back in the office! Back in late November I travelled to Vienna to speak at the FIBEP World Media Intelligence Congress with our client Infomedia about how we’ve helped them to migrate their media monitoring platform from the elderly, unsupported and hard to scale Verity software to an open source system based on our own Luwak library. We also replaced Autonomy IDOL with Apache Solr and helped Infomedia develop their own in-house query language, to prevent them becoming locked-in to any particular search technology. Indexing over 75 million news stories and running over 8000 complex stored queries over every new story as it appears, the new system is now in production and Infomedia were kind enough to say that ‘Flax’s expert knowledge has been invaluable’ (see the slides here). We celebrated after our talk at a spectacular Bollywood-themed gala dinner organised by Ninestars Global.

The week after I spoke at the Elasticsearch London Meetup with our client Westcoast on how we helped them build a better product search. Westcoast are the UK’s largest privately owned IT supplier and needed a fast and scalable search engine they could easily tune and adjust – we helped them build administration systems allowing boosts and editable synonym lists and helped them integrate Elasticsearch with their existing frontend systems. However, integrating with legacy systems is never a straightforward task and in particular we had to develop our own custom faceting engine for price and stock information. You can find out more in the slides here.

Search Solutions, my favourite search event of the year, was the next day and I particularly enjoyed hearing about Google’s powerful voice-driven search capabilities, our partner UXLab‘s research into complex search strategies and Digirati and Synaptica‘s complimentary presentations on image search and the International Image Interoperability Framework (a standard way to retrieve images by URL). Tessa Radwan of our client NLA media access spoke about some of the challenges in measuring similar news articles (for example, slightly rewritten for each edition of a daily newspaper) as part of the development of the new version of their Clipshare system, a project we’ve carried out over the last year of so. I also spoke on Test Driven Relevance, a theme I’ll be expanding on soon: how we could improve how search engines are tested and measured (slides here).

Thanks to the organisers of all these events for all their efforts and for inviting us to talk: it’s great to be able to share our experiences building search engines and to learn from others.

The post Out and about in search & monitoring – Autumn 2015 appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2015/12/16/search-monitoring-autumn-2015/feed/ 0