Tom Winch – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 XJoin for Solr, part 2: a click-through example http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/ http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/#respond Fri, 29 Jan 2016 09:39:00 +0000 http://www.flax.co.uk/?p=2941 In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use … More

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

]]>
In my last blog post, I demonstrated how to set up and configure Solr to use the new XJoin search components we’ve developed for the BioSolr project, using an example from an e-commerce setting. This time, I’ll show how to use XJoin to make use of user click-through data to influence the score of products in searches.

I’ll step through things a bit quicker this time around and I’ll be using code from the last post so reading that first is highly recommended. I’ll assume that the prerequisites from last time have been installed and set up in the same directories.

The design

Suppose we have a web page for searching a collection of products, and when a user clicks on product listing in the result set (or perhaps, when they subsequently go on to buy that product – or both) we insert a record in an SQL database, storing the product id, the query terms they used, and an arbitrary weight value (which will depend on whether they merely clicked on a result, or if they went on to buy it, or some other behaviour such as mouse pointer tracking). We then want to use the click-through data stored in that database to boost products in searches that use those query terms again.

We could use the sum of the weights of all occurrences of a product id/query term combination as the product score boost, but then we might start to worry about a feedback process occurring. Alternatively, we might take the maximum or average weight across the occurrences. In the code below, we’ll use the maximum.

The advantage of this design over storing the click-through information in Solr is that you don’t have to update the Solr index every time there is user activity, which could become costly. An SQL database is much more suited to this task.

The external click-through API

Again, we’ll be using Python 3 (using the flask and sqlite3 modules) to implement the external API. I’ll be using this API to update the click-through database (by hand, for this example) as well as having Solr query it using XJoin. Here’s the code (partly based on code taken from here for caching the database connection in the Flask application context, and see here if you’re interested in more details about sqlite3’s support for full text search). Again, all the code written for this example is also available in the BioSolr GitHub repository:

from flask import Flask, request, g
import json
import sqlite3 as sql

# flask application context attribute for caching database connection
DB_APP_KEY = '_database'

# default weight for storing against queries
DEFAULT_WEIGHT = 1.0

app = Flask(__name__)

def get_db():
  """ Obtain a (cached) DB connection and return a cursor for it.
  """
  db = getattr(g, DB_APP_KEY, None)
  if db is None:
    db = sql.connect('click.db')
    setattr(g, DB_APP_KEY, db)
    c = db.cursor()
    c.execute("CREATE VIRTUAL TABLE IF NOT EXISTS click USING fts4 ("
                "id VARCHAR(256),"
                "q VARCHAR(256),"
                "weight FLOAT"
              ")")
    c.close()
  return db

@app.teardown_appcontext
def teardown_db(exception):
  db = getattr(g, DB_APP_KEY, None)
  if db is not None:
    db.close()

@app.route('/')
def main():
  return 'click-through API'

@app.route('/click/<path:id>', methods=["PUT"])
def click(id):
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  try:
    w = float(request.args.get('weight', DEFAULT_WEIGHT))
  except ValueError:
    return 'Could not parse weight', 400

  # do the DB update
  db = get_db()
  try:
    c = db.cursor()
    c.execute("INSERT INTO click (id, q, weight) VALUES (?, ?, ?)", (id, q, w))
    db.commit()
    return 'OK'
  finally:
    c.close()

@app.route('/ids')
def ids():
  # validate request
  if 'q' not in request.args:
    return 'Missing q parameter', 400
  q = request.args['q']
  
  # do the DB lookup
  try:
    c = get_db().cursor()
    c.execute("SELECT id, MAX(weight) FROM click WHERE q MATCH ? GROUP BY id", (q, ))
    return json.dumps([{ 'id': id, 'weight': w } for id, w in c])
  finally:
    c.close()

if __name__ == "__main__":
  app.run(port=8001, debug=True)

This web API exposes two end-points. First we have PUT /click/[id] which is used when we want to update the SQL database after a user click. For the purposes of this demonstration, we’ll be hitting this end-point by hand using curl to avoid having to write a web UI. The other end-point, GET /ids?[query terms], is used by our XJoin component and returns a JSON-formatted array of id/weight objects where the query terms from the database match those given in the query string.

Java glue code

Now we just need the Java glue code that sits between the XJoin component and our external API. Here’s an implementation of XJoinResultsFactory that does what we need:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class ClickXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
  }

  /**
   * Use 'click' REST API to fetch current click data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    String q = URLEncoder.encode(params.get("q"), "UTF-8");
    String apiUrl = url + "?q=" + q;
    try (HttpConnection http = new HttpConnection(apiUrl)) {
      JsonArray products = (JsonArray)http.getJson();
      return new ClickResults(products);
    }
  }
    
  public class ClickResults implements XJoinResults {
    private Map<String, Click> clickMap;
    
    public ClickResults(JsonArray products) {
      clickMap = new HashMap<>();
      for (JsonValue product : products) {
        JsonObject object = (JsonObject)product;
        String id = object.getString("id");
        double weight = object.getJsonNumber("weight").doubleValue();
        clickMap.put(id, new Click(id, weight));
      }
    }
    
    public int getCount() {
      return clickMap.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      return clickMap.keySet();
    }

    @Override
    public Object getResult(String id) {
      return clickMap.get(id);
    }      
  }
  
  public class Click {
    
    private String id;
    private double weight;
    
    public Click(String id, double weight) {
      this.id = id;
      this.weight = weight;
    }
    
    public String getId() {
      return id;
    }
    
    public double getWeight() {
      return weight;
    } 
  }
}

Unlike the previous example, this time getResults() does depend on the SolrParams argument, so that the user’s query, q, is passed to the external API. Store this Java source in blog/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java and compile into a JAR (again, we also need the HttpConnection class from the last blog post as well as javax.json-1.0.4.jar):

blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/ClickXJoinResultsFactory.java
blog$ jar cvf click.jar -C bin .

Solr configuration

Starting with a fresh version of solrconfig.xml, insert these lines near the start to import the XJoin and user JARs (substitute /XXX with the full path to the parent of the blog directory):

<lib dir="${solr.install.dir:../../../..}/contrib/xjoin/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-xjoin-\d.*\.jar" />
<lib path="/XXX/blog/javax.json-1.0.4.jar" />
<lib path="/XXX/blog/click.jar" />

And our request handler configuration:

<queryParser name="xjoin" class="org.apache.solr.search.xjoin.XJoinQParserPlugin" />

<valueSourceParser name="weight" class="org.apache.solr.search.xjoin.XJoinValueSourceParser">
  <str name="attribute">weight</str>
  <double name="defaultValue">0.0</double>
</valueSourceParser>

<searchComponent name="x_click" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.ClickXJoinResultsFactory</str>
  <str name="joinField">id</str>
  <lst name="external">
    <str name="url">http://localhost:8001/ids</str>
  </lst>
</searchComponent>

<requestHandler name="/xjoin" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="wt">json</str>
    <str name="echoParams">none</str>
    <str name="defType">edismax</str>
    <str name="df">description</str>
    <str name="fl">*</str>

    <bool name="x_click">false</bool>
    <str name="x_click.results">count</str>
    <str name="x_click.fl">*</str>
  </lst>
  <arr name="first-components">
    <str>x_click</str>
  </arr>
  <arr name="last-components">
    <str>x_click</str>
  </arr>
</requestHandler>

Reload the Solr core (products) to get the new config in place.

Putting the pieces together

The following query will verify our Solr setup (remembering to escape curly brackets):

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&fl=id,name,score&rows=4' | jq .

I’ve used Solr parameter substitution with the q/qq parameters which will simplify later queries (this has been in Solr since 5.1). This query returns:

{
  "responseHeader": {
    "status": 0,
    "QTime": 25
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 2.9939778,
    "docs": [
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.9939778
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.8712361
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 1.8712361
      }
    ]
  }
}

Some repeat products in the data, but so far, so good. Next, get the click-through API running:

blog$ python3 click.py

And check it’s working (this should return [] whatever query is chosen because the click-through database is empty):

curl localhost:8001/ids?q=software | jq .

Now, let’s populate the click-through database by simulating user activity. Suppose, given the above product results, the user goes on to click through to the fourth product (or even buy it). Then, the UI would update the click web API to indicate this has happened. Let’s do this by hand, specifying the product id, the user’s query, and a weight score (here, I’ll use the value 3, supposing the user bought the product in the end):

curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/9200068133591804002?q=excel&weight=3'

Now, we can check the output that XJoin will see when using the click-through API:

blog$ curl localhost:8001/ids?q=excel | jq .

giving:

[
  {
    "weight": 3,
    "id": "http://www.google.com/base/feeds/snippets/9200068133591804002"
  }
]

Using the bf edismax parameter and the weight function set up in solrconfig.xml to extract the weight value from the external results stored in the x_click XJoin search component, we can boost product scores when they appear in the click-through database for the user’s query:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=excel&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

which gives:

{
  "responseHeader": {
    "status": 0,
    "QTime": 13
  },
  "response": {
    "numFound": 21,
    "start": 0,
    "maxScore": 3.2224145,
    "docs": [
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "score": 3.2224145
      },
      {
        "name": "individual software professor teaches excel and word",
        "id": "http://www.google.com/base/feeds/snippets/13017887935047670097",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/7197668762339216420",
        "score": 2.4895983
      },
      {
        "name": "individual software prm-xw3 professor teaches excel & word",
        "id": "http://www.google.com/base/feeds/snippets/16702106469790828707",
        "score": 1.5559989
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/9200068133591804002",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/9200068133591804002",
          "weight": 3
        }
      }
    ]
  }
}

Lo and behold, the product the user clicked on now appears top of the Solr results for the that query. Have a play with the API, generate some more user activity and see how this effects subsequent queries. It will cope fine with multiple-word queries, for example, suppose a user searches for ‘games software’:

curl 'http://localhost:8983/solr/products/xjoin?qq=games+software&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

There being no relevant queries in the click-through database, this has the same results as for a query without the XJoin, and as we can see, the value of response.x_click.count is 0:

{
  "responseHeader": {
    "status": 0,
    "QTime": 15
  },
  "response": {
    "numFound": 1158,
    "start": 0,
    "maxScore": 0.91356516,
    "docs": [
      {
        "name": "encore software 10568 - encore hoyle puzzle & board games 2005 - complete product - puzzle game - 1 user - complete product - standard - pc",
        "id": "http://www.google.com/base/feeds/snippets/4998847858583359731",
        "score": 0.91356516
      },
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 0.8699497
      },
      {
        "name": "encore software 10027 - hoyle board games (win 98 me 2000 xp)",
        "id": "http://www.google.com/base/feeds/snippets/8664755713112971171",
        "score": 0.85982025
      },
      {
        "name": "encore software 11253 - brain food games: cranium collection 2006 sb cs by encore",
        "id": "http://www.google.com/base/feeds/snippets/15401280256033043239",
        "score": 0.78744644
      }
    ]
  },
  "x_click": {
    "count": 0,
    "external": []
  }
}

Now let’s simulate the same user clicking on the second product (with default weight):

blog$ curl -XPUT 'localhost:8001/click/http://www.google.com/base/feeds/snippets/826668451451666270?q=games+software'

Next, suppose another user then searches for just ‘games’:

blog$ curl 'http://localhost:8983/solr/products/xjoin?qq=games&q=$\{qq\}&x_click=true&x_click.external.q=$\{qq\}&bf=weight(x_click)^4&fl=id,name,score&rows=4' | jq .

In the results, we see the ‘wild games’ product boosted to the top:

{
  "responseHeader": {
    "status": 0,
    "QTime": 60
  },
  "response": {
    "numFound": 212,
    "start": 0,
    "maxScore": 1.3652229,
    "docs": [
      {
        "name": "encore software 11141 - fate sb cs by wild games",
        "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "score": 1.3652229
      },
      {
        "name": "xbox 360: ddr universe",
        "id": "http://www.google.com/base/feeds/snippets/16659259513615352372",
        "score": 0.95894843
      },
      {
        "name": "south park chef's luv shack",
        "id": "http://www.google.com/base/feeds/snippets/11648097795915093399",
        "score": 0.95894843
      },
      {
        "name": "egames. inc casual games pack",
        "id": "http://www.google.com/base/feeds/snippets/16700933768709687512",
        "score": 0.89483213
      }
    ]
  },
  "x_click": {
    "count": 1,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/826668451451666270",
        "doc": {
          "id": "http://www.google.com/base/feeds/snippets/826668451451666270",
          "weight": 1
        }
      }
    ]
  }
}

Extensions

Of course, this approach can be extended to add in more sophisticated weighting and boosting strategies, or include more data about the user activity than just a simple weight score, which could be used to augment the display of the product in the UI (for example, “ten customers in the UK bought this product in the last month”).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

The post XJoin for Solr, part 2: a click-through example appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/29/xjoin-solr-part-2-click-example/feed/ 0
XJoin for Solr, part 1: filtering using price discount data http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/ http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/#comments Mon, 25 Jan 2016 10:04:28 +0000 http://www.flax.co.uk/?p=2928 In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an … More

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

The XJoin component can be used when you want values from some source external to Solr to filter or influence the score of hits in your Solr result set.  It is currently available as a Solr patch on the XJoin JIRA ticket SOLR-7341, so to use it, you’ll need to check out a version of Apache Lucene/Solr using Subversion, then patch and build it (see below for details).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

Patching SOLR

I’m going to be using Solr version 5.3 for this blog. If you’re following along, check out a clean copy using Subversion:

$ svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3

Download the XJoin patch (find the one corresponding to this version of Solr on the JIRA ticket) into the newly checked-out directory, and apply it:

lucene_solr_5_3$ svn patch SOLR-7341.patch-5_3

And then build Solr from the solr sub-directory:

lucene_solr_5_3/solr$ ant server

We should now be able to start the patched Solr server:

lucene_solr_5_3/solr$ bin/solr start

Indexing a sample product data set

I’ll be using a sample Google product feed, GoogleProducts.csv, which I got from here. Create a new directory called blog (mine has the same parent as my Solr check-out) and download the sample into it. It’s in CSV format, with columns for product id, name, description, manufacturer and price. Indexing this will be a piece of cake!

We’ll begin with a copy of the sample Solr config directory:

blog$ cp -r ../lucene_solr_5_3/solr/server/solr/configsets/basic_configs/conf .

Modify conf/schema.xml so that our Solr documents have fields corresponding to those in the CSV file:

<field name="id" type="string" indexed="true" /> 
<field name="name" type="text_en" indexed="true" />
<field name="description" type="text_en" indexed="true" />
<field name="manufacturer" type="string" indexed="true" />
<field name="price" type="float" indexed="true" />

Naturally, the product id will serve as the Solr unique key:

<uniqueKey>id</uniqueKey>

We can use the sample solrconfig.xml as is for now. Add a core called products using the Solr core admin UI (as you started a Solr server above, this should be available at   http://localhost:8983/solr/#/~cores). The values for instanceDir and dataDir will both be the full path of the blog directory.

I’ll be using Python to index the product data. The code is written for Python 3, and won’t work in Python 2.x because of character encoding issues in the csv module, but you can fix it by using a UTF8Recoder as described in the module documentation. Here’s my indexing script (note that all the code written for this example is also available in the BioSolr GitHub repository):

import sys
import csv
import json
import requests

def value(k, v):
    return k, v.strip() if k != 'price' else float(v.split()[0])

def read(path):
    with open(path, encoding='iso-8859-1') as f:
        reader = csv.DictReader(f)
        for doc in reader:
            yield dict(value(k, v) for k, v in doc.items()
                       if len(v.strip()) > 0)

def index(url, docs):
    print("Sending {0} documents to {1}".format(len(docs), url))
    data = json.dumps(docs)
    headers = { 'content-type': 'application/json' }
    r = requests.post(url, data=data, headers=headers)
    if r.status_code != 200:
      raise IOError("Bad SOLR update")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {0} <Solr update URL> <CSV file>".format(sys.argv[0]))
        sys.exit(1)

    docs = list(read(sys.argv[2]))
    index(sys.argv[1], docs)

The script tidies up the prices because they aren’t consistently formatted, converting them to float values. Save the script in index.py and use it to index the Google product data into Solr (let’s force commits, just to be sure):

blog$ python3 index.py http://localhost:8983/solr/products/update?commit=true GoogleProducts.csv

And, lo and behold, we can see our data in Solr using cURL (I like to pipe the output through jq to get nicely formatted JSON):

curl 'localhost:8983/solr/products/select?wt=json&q=*' | jq .

So, using Solr we’ve now built a full text product search in only a few minutes, with potentially all the add-ons Solr provides out of the box. However, suppose there is supplementary information about the products, available from an external source (which might not be under our control).

I will now demonstrate how to configure Solr so that during a product search, the external source is also queried (either with the same user query or something different) and the resulting external data used to influence the result set. Each external result is ‘joined’ against a Solr document via a ‘join field’ or ‘join id’, which doesn’t have to be the Solr unique id (in the examples below I use the product id and manufacturer as the join fields). To get an ‘inner join’ I will use the XJoinQParserPlugin to turn the external ids into a filter query, but it’s also possible to build boost queries or use the XJoinValueSourceParser to use external values in a boost function. You can see all this implemented below.

Product discount offers example

In the first of my examples, I’ll set up filtering and score boosting based on discount offers, the external source for which is going to be a web service, which I’m going to make available locally on the URL http://localhost:8000/offers.  Again, I’ll implement this in Python, using the popular Flask web server micro-framework and the module requests.  Install both of these using pip (I need sudo, but you might not):

blog$ sudo pip install flask requests

Creating the external source

Here’s my code for the product offers web API:

from flask import Flask
from index import read
import json
import random
import sys

app = Flask(__name__)

@app.route('/')
def main():
    return json.dumps({ 'info': 'product offers API' })

@app.route('/products')
def products():
    offer = lambda doc: {
                'id': doc['id'],
                'discountPct': random.randint(1, 80)
            }
    return json.dumps([offer(doc) for doc
                       in random.sample(app.docs, 64)])

@app.route('/manufacturers')
def manufacturer():
  manufacturers = set(doc['manufacturer'] for doc in app.docs
                      if 'manufacturer' in doc)
  deal = lambda m: {
             'manufacturer': m,
             'discountPct': random.randint(1, 10) * 5
         }
  return json.dumps([deal(m) for m
                     in random.sample(manufacturers, 3)])

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("Usage: {0} <CSV file>".format(sys.argv[0]))
    sys.exit(1)

  app.docs = list(read(sys.argv[1]))
  app.run(port=8000, debug=True)

The code generates discounts for a random selection of products and manufacturers. Save it to blog/offer.py and start the server, supplying the Google products CSV file on the command line:

blog$ python3 offer.py GoogleProducts.csv

Now, test it out using cURL (again, I like to pipe through jq to get nicely formatted JSON):

$ curl -s localhost:8000/products | jq .

You should see a list of objects, each with a product id and a discount percentage, something like:

[
  {
    "discountPct": 41,
    "id": "http://www.google.com/base/feeds/snippets/18100341066456401733"
  },
  {
    "discountPct": 63,
    "id": "http://www.google.com/base/feeds/snippets/16969493842479402672"
  },
  {
    "discountPct": 13,
    "id": "http://www.google.com/base/feeds/snippets/10357785197400989441"
  },
  {
    "discountPct": 35,
    "id": "http://www.google.com/base/feeds/snippets/2813321165033737171"
  },
  {
    "discountPct": 27,
    "id": "http://www.google.com/base/feeds/snippets/15203735208016659510"
  },
  ...
]

You get similar output if you use the /manufacturers endpoint:

$ curl -s localhost:8000/manufacturers | jq .

This time, we get a shorter list, of manufacturers each with a discount percentage, for example:

[
  {
    "discountPct": 15,
    "manufacturer": "freeverse software"
  },
  {
    "discountPct": 5,
    "manufacturer": "pinnacle systems"
  },
  {
    "discountPct": 50,
    "manufacturer": "destineer inc"
  }
]

Creating XJoin glue code

To bridge the gap between Solr and our external data source, XJoin requires some glue code, written in Java, to query the source and return the results. First, I’ll create a quick utility class to help with HTTP connections:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.json.Json;
import javax.json.JsonReader;
import javax.json.JsonStructure;

public class HttpConnection implements AutoCloseable {
  private HttpURLConnection http;
  
  public HttpConnection(String url) throws IOException {
    http = (HttpURLConnection)new URL(url).openConnection();
  }
  
  public JsonStructure getJson() throws IOException {
    http.setRequestMethod("GET");
    http.setRequestProperty("Accept", "application/json");
    try (InputStream in = http.getInputStream();
         JsonReader reader = Json.createReader(in)) {
      return reader.read();
    }
  }
  
  @Override
  public void close() {
    http.disconnect();
  }
}

Save this as blog/java/uk/co/flax/examples/xjoin/HttpConnection.java. The glue code we need is fairly simple, and can be written as a single class, implementing the XJoinResultsFactory interface:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class OfferXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  private String field;
  private String discountField;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
    field = (String)args.get("field");
    discountField = (String)args.get("discountField");
  }

  /**
   * Use 'offers' REST API to fetch current offer data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    try (HttpConnection http = new HttpConnection(url)) {
      JsonArray offers = (JsonArray)http.getJson();
      return new OfferResults(offers);
    }
  }
   
  /**
   * Results of the external search - methods like getXXX() are used
   * to expose the property XXX in the SOLR results.
   */
  public class OfferResults implements XJoinResults {
    private JsonArray offers;
    
    public OfferResults(JsonArray offers) {
      this.offers = offers;
    }
    
    public int getCount() {
      return offers.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      List ids = new ArrayList<>();
      for (JsonValue offer : offers) {
        ids.add(((JsonObject)offer).getString(field));
      }
      return ids;
    }

    @Override
    public Object getResult(String joinIdStr) {
      for (JsonValue offer : offers) {
        String id = ((JsonObject)offer).getString(field);
        if (id.equals(joinIdStr)) {
          return new Offer(offer);
        }
      }
      return null;
    }
  }
  
  /**
   * A discount offer - methods like getXXX() are used to expose
   * properties that can be joined with each Solr result via the join
   * id field.
   */
  public class Offer {
    private JsonValue offer;
    
    public Offer(JsonValue offer) {
      this.offer = offer;
    }
    
    public double getDiscount() {
      return ((JsonObject)offer).getInt(discountField) * 0.01d;
    }
  }
}

Here, the init() method initialises the URL for the external API and the names of the values we want to pick out from the external data. The getResults() method connects to the external API – since in this example, the discounts do not depend on the user’s query, we don’t use the SolrParams argument at all. It returns an implementation of XJoinResults, which must be able to return a collection of join ids (so, the value of the join id field for each external result), and also be able to return an external result object given a join id. Together, the XJoinResults object and each external result object contain the results of the external search, exposed via getXXX() methods (which are mapped to properties called XXX) and (once everything is plumbed in) available to Solr for filtering, affected the scores of documents, or for inclusion in the results set.

Save the above as blog/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java. You’ll also need javax.json-1.0.4.jar, which you can download from here if you don’t already have it – place it in the blog directory. Compile the two Java source files, and create a JAR to contain the resulting .class files:

blog$ mkdir bin
blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java
blog$ jar cvf offer.jar -C bin .

Configuring XJoin

So now – at last! – I’ll configure a Solr query handler that uses the XJoin Solr plugin components to add filters and boost queries based on the external data.

I’ll be working with blog/conf/solrconfig.xml now. The first thing to do is include the contrib JARs for XJoin and our glue code JAR (offer.jar) in <lib> directives near the top of the config file. To do that, add in the following snippet just under the <dataDir> directive:

<lib dir="${solr.install.dir:../../../..}/contrib/xjoin/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-xjoin-\d.*\.jar" />
<lib path="/XXX/blog/javax.json-1.0.4.jar" />
<lib path="/XXX/blog/offer.jar" />

Here, you need to substitute /XXX with the full path to the parent of the blog directory.  (We need to include javax.json-1.0.4.jar because it’s a dependency of our offer.jar.) Now for the request handler config – I’ll include everything we’re going to need even though it won’t all be used straightaway:

<queryParser name="xjoin" class="org.apache.solr.search.xjoin.XJoinQParserPlugin" />

<valueSourceParser name="discount" class="org.apache.solr.search.xjoin.XJoinValueSourceParser">
  <str name="attribute">discount</str>
  <double name="defaultValue">0.0</double>
</valueSourceParser>

<searchComponent name="x_product_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">id</str>
  <lst name="external">
    <str name="url">http://localhost:8000/products</str>
    <str name="field">id</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<searchComponent name="x_manufacturer_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">manufacturer</str>
  <lst name="external">
    <str name="url">http://localhost:8000/manufacturers</str>
    <str name="field">manufacturer</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<requestHandler name="/xjoin" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="wt">json</str>
    <str name="echoParams">all</str>
    <str name="defType">edismax</str>
    <str name="df">description</str>
    <str name="fl">*</str>

    <bool name="x_product_offers">false</bool>
    <str name="x_product_offers.results">count</str>
    <str name="x_product_offers.fl">*</str>

    <bool name="x_manufacturer_offers">false</bool>
    <str name="x_manufacturer_offers.results">count</str>
    <str name="x_manufacturer_offers.fl">*</str>
  </lst>
  <arr name="first-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
  <arr name="last-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
</requestHandler>

Insert this request handler config somewhere near the bottom of solrconfig.xml.

Using XJoin in a query

Let’s quickly get a query working, then I’ll explain what all the components that I’ve included do. Try this (remembering to escape curly brackets on the command line):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&fq=\{!xjoin\}x_product_offers&fl=id,name&rows=4' | jq .

You should see output like this (I’ve edited responseHeader.params for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 22,
    "params": {
      "x_product_offers": "true", 
      "x_product_offers.results": "count",
      "x_product_offers.fl": "*", 
      "q": "*", 
      "fq": "{!xjoin}x_product_offers", 
      "fl": "id,name",
      "rows": "4"
    }
  },
  "response": {
    "numFound": 64,
    "start": 0,
    "docs": [
      {
        "name": "did0480p-m311 plasmon additional maintenance 24x7 - plasmon diamond technical support - consul",
        "id": "http://www.google.com/base/feeds/snippets/13522752516373728128"
      },
      {
        "name": "apple ilife '06 family pack",
        "id": "http://www.google.com/base/feeds/snippets/10939909441298262260"
      },
      {
        "name": "adobe cs3 web standard upsell",
        "id": "http://www.google.com/base/feeds/snippets/8042583218932085904"
      },
      {
        "name": "the richard friedman trio motown hits - *(for the tg-100)*",
        "id": "http://www.google.com/base/feeds/snippets/17853905518738313346"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13522752516373728128",
        "doc": {
          "discount": 0.11
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/10939909441298262260",
        "doc": {
          "discount": 0.76
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/8042583218932085904",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17853905518738313346",
        "doc": {
          "discount": 0.05
        }
      }
    ]
  }
}

Here you can see the usual Solr output with our product documents in the response.docs array. Notice the value of response.numFound is only 64 out of a possible 3226. Additionally, we have an extra section, response.x_product_offers, that gives us results from the external offers API – count tells us the total number of external results found, and there is an external result object with a join id matching each hit in the Solr results.

The query we made to get these results is a combination of the parameters in the request handler, and those in the URL’s query string – I’ve left the pertinent ones in responseHeader.paramsThe first parameter, x_product_offers=true, turns on the XJoin component that talks to the offers API, so that at query time, it will make a connection and retrieve external results (note that in this case, no parameters are passed to the external API – the following blog post will demonstrate this). The following two parameters control which fields are output from the external results – the .results option is a field list which controls the fields returned from the OfferResults object (that’s our implementation of XJoinResults – see the code above – there is one OfferResults object per external request and it acts as a collection of the returned external results). Then the .fl option is another field list which controls the fields returned for each external result object – these values can be used for filtering, boosting, and so on (for more on which, see below).

The parameters q=*, fl=id,name and rows=4 have their usual effects. The really interesting parameter is the filter query:

fq={!xjoin}x_product_offers

This uses Solr local parameters “short-form” syntax to reference the XJoinQParserPlugin that was set up in solrconfig.xml (it doesn’t take any initialisation parameters). This component uses the join ids from the referenced XJoin component to create a query that ORs together terms like join_field:join_id (one for each external result). It is based on the Solr built-in TermsQParserPlugin and supports the same method parameter (but this can usually be omitted). So, here, it makes a filter based on the join ids returned by the offers API – thus, only the products which have a current offer are returned.

Note that we could have used the same syntax in just the q parameter to achieve the same effect, but it’s more usual that a user full text query is specified in and a ‘join’ created using a filter query.

Using the XJoinValueSourceParser

The XJoinValueSourceParser component that we have configured in solrconfig.xml provides us with a function, discount, that we can use in a function query. I configured the component to extract the value of discount from external results, and we supply an XJoin component name as the argument – this is a reference to a set of external results.

This opens up lots of possibilities, for example, a search in which each product’s score is  boosted by a reciprocal function of the price including discount (so cheaper products, after discounting, are boosted higher):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2&fl=id,price,score&rows=4' | jq .

which results in a response something like (again, with responseHeader.params edited for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 55,
    "params": {
       "x_product_offers": "true", 
       "x_product_offers.results": "count", 
       "x_product_offers.fl": "*", 
       "q": "*", 
       "bf": "recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2",
       "fl": "id,price,score",
       "rows": "4"
     }
   },
  "response": {
    "numFound": 3226,
    "start": 0,
    "maxScore": 1.3371909,
    "docs": [
      {
        "id": "http://www.google.com/base/feeds/snippets/549551716004314019",
        "price": 0.5,
        "score": 1.3371909
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "price": 8.49,
        "score": 1.325241
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "price": 9.9,
        "score": 1.3166784
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/18427513736767114578",
        "price": 2.99,
        "score": 1.3156738
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "doc": {
          "discount": 0.71
        }
      }
    ]
  }
}

This time, because we haven’t applied on a filter based on the external join ids, we still have the full set of documents in the results set (3226 in total). Note that although there are 4 results in response.docs (as requested by rows=4), there are only 2 external results in x_product_offers.external – this is because only 2 of those 4 Solr documents have matching external results (in that they have the same value of join id in the join field, which in this case is the product id). In other words, only 2 out of the 4 products returned have discounts offered.

To achieve the price boost, instead of a filter query, we have a boost function:

bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2

For each Solr document in the results set, the value of the expression discount(x_product_offers) is found by calling getDiscount() on the matching external result  in the x_product_offers XJoin search component. When there is no matching external result, the default value 0.0 is used, as configured for the value source parser in solrconfig.xml, which is equivalent to a 0% discount.

Of course, instead of the match-all q=* query, we can do an actual product search with our price boost, for example, q=apple. To be more sophisticated, we can also use the edismax parameter qf to query across both the name and description fields and weight them as we desire, for example, qf=name^4 description^2 or similar.

Joining on a field other than the unique id field

The join field does not have to correspond to the Solr unique id field. As seen above, the offers web API also returns discounts based on manufacturer (using the /manufacturers end-point). I configured another XJoin search component in solrconfig.xml called x_manufacturer_offers, the only differences from x_product_offers being the join field, which is now manufacturer, and the field which is taken from the external results to be the join value, which is of course the same, manufacturer.

So, now for example we can do a weighted query for “games software”, but restricting to products that have a manufacturer discount of at least 20%:

blog$ curl 'localhost:8983/solr/products/xjoin?q=software&qf=name^4+description^2&x_manufacturer_offers=true&fq=\{!frange+l=0.2\}discount(x_manufacturer_offers)&fl=*&rows=4' | jq .

See FunctionRangeQParserPlugin for details of the filter query used in this search. This gives something like (responseHeader.params omitted this time):

{
  "responseHeader": {
    "status": 0,
    "QTime": 4
  },
  "response": {
    "numFound": 25,
    "start": 0,
    "maxScore": 1.1224447,
    "docs": [
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/7436299398173390476",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17001745805951209994",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/10584509515076384561",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329559531500
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17283219592038470822",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329681166300
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "freeverse software",
        "doc": {
          "discount": 0.2
        }
      }
    ]
  }
}

In this case, there was only one manufacturer represented in the requested top 4 rows of the Solr results set.

Using two XJoin components in the same query

It’s worth noting that you can use more than one XJoin component in the same query. You can come up with more complicated examples, but this one shows how to query for all products that have a manufacturer discount as well as a product discount:

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&x_manufacturer_offers=true&fq=\{!xjoin\}x_product_offers&fq=\{!xjoin\}x_manufacturer_offers&fl=id,name,manufacturer&rows=4&wt=json' | jq .

You might have to try again a few times before you get a non-empty result set – here’s one I got:

{
  "responseHeader": {
    "status": 0,
    "QTime": 7
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "name": "apple software m8789z/a webobjects 5.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/4776201646741876078"
      },
      {
        "name": "apple software m9301z/b soundtrack v1.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/16537637847870148950"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/4776201646741876078",
        "doc": {
          "discount": 0.59
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/16537637847870148950",
        "doc": {
          "discount": 0.22
        }
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "apple software",
        "doc": {
          "discount": 0.3
        }
      }
    ]
  }
}

So you can see that there are two external results sections, one for product offers and one for manufacturer offers, and how the offers are matched to the products by the join ids (which is either the product id, or the manufacturer).

Next time…

In my next blog post, I’ll dive in to another demonstration of XJoin, in which I show how to use click-through data to influence the score of subsequent searches.

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/feed/ 5