java – Flax http://www.flax.co.uk The Open Source Search Specialists Thu, 10 Oct 2019 09:03:26 +0000 en-GB hourly 1 https://wordpress.org/?v=4.9.8 London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/ http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/#respond Thu, 08 Feb 2018 14:55:22 +0000 http://www.flax.co.uk/?p=3688 This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you! Our first talk was from Uwe … More

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
This time Pivotal were our kind hosts for the London Lucene/Solr Meetup, providing a range of goodies including some frankly enormous pizzas – thanks Costas and colleagues, we couldn’t have done it without you!

Our first talk was from Uwe Schindler, Lucene committer, who started with some history of how previous Java 7 releases had broken Apache Lucene in somewhat spectacular fashion. After this incident the Oracle JDK team and Lucene PMC worked closely together to improve both communications and testing – with regular builds of Java 8 (using Jenkins) being released to test with Lucene. The Oracle team later publically thanked the Lucene committers for their help in finding Java issues. Uwe told us how Java 9 introduced a module system named ‘Jigsaw’ which tidied up various inconsistencies in how Java keeps certain APIs private (but not actually private) – this caused some problems with Solr. Uwe also mentioned how Java’s MMapDirectory feature should be used with Lucene on 64 bit platforms (there’s a lot more detail on his blog) and various intrinsic bounds checking feeatures which can be used to simplify Lucene code. The three main advantages of Java 9 that he mentioned were lower garbage collection times (with the new G1GC collector), more security features and in some cases better query performance. Going forward, Uwe is already looking at Java 10 and future versions and how they impact Lucene – but for now he’s been kind enough to share his slides from the Meetup.

Our second speaker was Andy Hind, head of search at Alfresco. His presentation included the obvious Austin Powers references of course! He described the architecture Alfresco use for search (a recent blog also shows this – interestingly although Solr is used, Zookeeper is not – Alfresco uses its own method to handle many Solr servers in a cluster). The test system described ran on the Amazon EC2 cloud with 10 Alfresco nodes and 20 Solr nodes and indexed around 1.168 billion items. The source data was synthetically generated to simulate real-world conditions with a certain amount of structure – this allowed queries to be built to hit particular areas of the data. 5000 users were set up with around 500 concurrent users assumed. The test system managed to index the content in around 5 days at a speed of around 1000 documnents a second which is impressive.

Thanks to both our speakers and we’ll return soon – if you have a talk for our group (or can host a Meetup) do please get in touch.

The post London Lucene/Solr Meetup – Java 9 & 1 Beeelion Documents with Alfresco appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2018/02/08/london-lucene-solr-meetup-java-9-1-beeelion-documents-alfresco/feed/ 0
Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/ http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/#comments Thu, 03 Mar 2016 10:01:00 +0000 http://www.flax.co.uk/?p=3055 We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them. The grandfather of them all of course is Apache Hadoop, now not so much … More

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
We’ve been working on a number of projects recently involving open source software often quoted as ‘Big Data’ solutions – here’s a quick overview of them.

The grandfather of them all of course is Apache Hadoop, now not so much a single project as an ecosystem including storage and processing for potentially huge amounts of data, spread across clusters of machines. Interestingly Hadoop was originally created by Doug Cutting, who also wrote Lucene (the search library used by Apache Solr and Elasticsearch) and the Nutch web crawler. We’ve been helping clients distribute processing tasks using Hadoop’s MapReduce algorithm and also to speed up their indexing from Hadoop into Elasticsearch. Other projects we’ve used in the Hadoop ecosystem include Apache Zookeeper (used to coordinate lots of Solr servers into a distributed SolrCloud) and Apache Spark (for distributed processing).

We’re increasingly using Apache Kafka (a message broker) for handling large volumes of streaming data, for example log files. Kafka provides persistent storage of these streams, which might be ingested and pre-processed using Logstash and then indexed with Elasticsearch and visualised with Kibana to build high-performance monitoring systems. Throughput of thousands of items a second is not uncommon and these open source systems can easily match the performance of proprietary monitoring engines such as Splunk at a far lower cost. Apache Samza, a stream processing framework, is based on Kafka and we’ve built a powerful full-text search for streams system using it. Note that Elasticsearch has a similar ‘stored search’ feature called Percolator, but this is quite a lot slower (as others have confirmed).

Most of the above systems are written in Java, and if not run on the Java Virtual Machine (JVM), so our experience building large, performant and resilient systems on this platform has been invaluable. We’ll be writing in more detail about these projects soon. I’ve always said that search experts have been dealing with Big Data since well before it gained popularity as a concept – so if you’re serious about Big Data, ask us how we could help!

The post Working with Hadoop, Kafka, Samza and the wider Big Data ecosystem appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/03/03/working-hadoop-kafka-samza-wider-big-data-ecosystem/feed/ 2
XJoin for Solr, part 1: filtering using price discount data http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/ http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/#comments Mon, 25 Jan 2016 10:04:28 +0000 http://www.flax.co.uk/?p=2928 In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an … More

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

The XJoin component can be used when you want values from some source external to Solr to filter or influence the score of hits in your Solr result set.  It is currently available as a Solr patch on the XJoin JIRA ticket SOLR-7341, so to use it, you’ll need to check out a version of Apache Lucene/Solr using Subversion, then patch and build it (see below for details).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

Patching SOLR

I’m going to be using Solr version 5.3 for this blog. If you’re following along, check out a clean copy using Subversion:

$ svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3

Download the XJoin patch (find the one corresponding to this version of Solr on the JIRA ticket) into the newly checked-out directory, and apply it:

lucene_solr_5_3$ svn patch SOLR-7341.patch-5_3

And then build Solr from the solr sub-directory:

lucene_solr_5_3/solr$ ant server

We should now be able to start the patched Solr server:

lucene_solr_5_3/solr$ bin/solr start

Indexing a sample product data set

I’ll be using a sample Google product feed, GoogleProducts.csv, which I got from here. Create a new directory called blog (mine has the same parent as my Solr check-out) and download the sample into it. It’s in CSV format, with columns for product id, name, description, manufacturer and price. Indexing this will be a piece of cake!

We’ll begin with a copy of the sample Solr config directory:

blog$ cp -r ../lucene_solr_5_3/solr/server/solr/configsets/basic_configs/conf .

Modify conf/schema.xml so that our Solr documents have fields corresponding to those in the CSV file:

<field name="id" type="string" indexed="true" /> 
<field name="name" type="text_en" indexed="true" />
<field name="description" type="text_en" indexed="true" />
<field name="manufacturer" type="string" indexed="true" />
<field name="price" type="float" indexed="true" />

Naturally, the product id will serve as the Solr unique key:

<uniqueKey>id</uniqueKey>

We can use the sample solrconfig.xml as is for now. Add a core called products using the Solr core admin UI (as you started a Solr server above, this should be available at   http://localhost:8983/solr/#/~cores). The values for instanceDir and dataDir will both be the full path of the blog directory.

I’ll be using Python to index the product data. The code is written for Python 3, and won’t work in Python 2.x because of character encoding issues in the csv module, but you can fix it by using a UTF8Recoder as described in the module documentation. Here’s my indexing script (note that all the code written for this example is also available in the BioSolr GitHub repository):

import sys
import csv
import json
import requests

def value(k, v):
    return k, v.strip() if k != 'price' else float(v.split()[0])

def read(path):
    with open(path, encoding='iso-8859-1') as f:
        reader = csv.DictReader(f)
        for doc in reader:
            yield dict(value(k, v) for k, v in doc.items()
                       if len(v.strip()) > 0)

def index(url, docs):
    print("Sending {0} documents to {1}".format(len(docs), url))
    data = json.dumps(docs)
    headers = { 'content-type': 'application/json' }
    r = requests.post(url, data=data, headers=headers)
    if r.status_code != 200:
      raise IOError("Bad SOLR update")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {0} <Solr update URL> <CSV file>".format(sys.argv[0]))
        sys.exit(1)

    docs = list(read(sys.argv[2]))
    index(sys.argv[1], docs)

The script tidies up the prices because they aren’t consistently formatted, converting them to float values. Save the script in index.py and use it to index the Google product data into Solr (let’s force commits, just to be sure):

blog$ python3 index.py http://localhost:8983/solr/products/update?commit=true GoogleProducts.csv

And, lo and behold, we can see our data in Solr using cURL (I like to pipe the output through jq to get nicely formatted JSON):

curl 'localhost:8983/solr/products/select?wt=json&q=*' | jq .

So, using Solr we’ve now built a full text product search in only a few minutes, with potentially all the add-ons Solr provides out of the box. However, suppose there is supplementary information about the products, available from an external source (which might not be under our control).

I will now demonstrate how to configure Solr so that during a product search, the external source is also queried (either with the same user query or something different) and the resulting external data used to influence the result set. Each external result is ‘joined’ against a Solr document via a ‘join field’ or ‘join id’, which doesn’t have to be the Solr unique id (in the examples below I use the product id and manufacturer as the join fields). To get an ‘inner join’ I will use the XJoinQParserPlugin to turn the external ids into a filter query, but it’s also possible to build boost queries or use the XJoinValueSourceParser to use external values in a boost function. You can see all this implemented below.

Product discount offers example

In the first of my examples, I’ll set up filtering and score boosting based on discount offers, the external source for which is going to be a web service, which I’m going to make available locally on the URL http://localhost:8000/offers.  Again, I’ll implement this in Python, using the popular Flask web server micro-framework and the module requests.  Install both of these using pip (I need sudo, but you might not):

blog$ sudo pip install flask requests

Creating the external source

Here’s my code for the product offers web API:

from flask import Flask
from index import read
import json
import random
import sys

app = Flask(__name__)

@app.route('/')
def main():
    return json.dumps({ 'info': 'product offers API' })

@app.route('/products')
def products():
    offer = lambda doc: {
                'id': doc['id'],
                'discountPct': random.randint(1, 80)
            }
    return json.dumps([offer(doc) for doc
                       in random.sample(app.docs, 64)])

@app.route('/manufacturers')
def manufacturer():
  manufacturers = set(doc['manufacturer'] for doc in app.docs
                      if 'manufacturer' in doc)
  deal = lambda m: {
             'manufacturer': m,
             'discountPct': random.randint(1, 10) * 5
         }
  return json.dumps([deal(m) for m
                     in random.sample(manufacturers, 3)])

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("Usage: {0} <CSV file>".format(sys.argv[0]))
    sys.exit(1)

  app.docs = list(read(sys.argv[1]))
  app.run(port=8000, debug=True)

The code generates discounts for a random selection of products and manufacturers. Save it to blog/offer.py and start the server, supplying the Google products CSV file on the command line:

blog$ python3 offer.py GoogleProducts.csv

Now, test it out using cURL (again, I like to pipe through jq to get nicely formatted JSON):

$ curl -s localhost:8000/products | jq .

You should see a list of objects, each with a product id and a discount percentage, something like:

[
  {
    "discountPct": 41,
    "id": "http://www.google.com/base/feeds/snippets/18100341066456401733"
  },
  {
    "discountPct": 63,
    "id": "http://www.google.com/base/feeds/snippets/16969493842479402672"
  },
  {
    "discountPct": 13,
    "id": "http://www.google.com/base/feeds/snippets/10357785197400989441"
  },
  {
    "discountPct": 35,
    "id": "http://www.google.com/base/feeds/snippets/2813321165033737171"
  },
  {
    "discountPct": 27,
    "id": "http://www.google.com/base/feeds/snippets/15203735208016659510"
  },
  ...
]

You get similar output if you use the /manufacturers endpoint:

$ curl -s localhost:8000/manufacturers | jq .

This time, we get a shorter list, of manufacturers each with a discount percentage, for example:

[
  {
    "discountPct": 15,
    "manufacturer": "freeverse software"
  },
  {
    "discountPct": 5,
    "manufacturer": "pinnacle systems"
  },
  {
    "discountPct": 50,
    "manufacturer": "destineer inc"
  }
]

Creating XJoin glue code

To bridge the gap between Solr and our external data source, XJoin requires some glue code, written in Java, to query the source and return the results. First, I’ll create a quick utility class to help with HTTP connections:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.json.Json;
import javax.json.JsonReader;
import javax.json.JsonStructure;

public class HttpConnection implements AutoCloseable {
  private HttpURLConnection http;
  
  public HttpConnection(String url) throws IOException {
    http = (HttpURLConnection)new URL(url).openConnection();
  }
  
  public JsonStructure getJson() throws IOException {
    http.setRequestMethod("GET");
    http.setRequestProperty("Accept", "application/json");
    try (InputStream in = http.getInputStream();
         JsonReader reader = Json.createReader(in)) {
      return reader.read();
    }
  }
  
  @Override
  public void close() {
    http.disconnect();
  }
}

Save this as blog/java/uk/co/flax/examples/xjoin/HttpConnection.java. The glue code we need is fairly simple, and can be written as a single class, implementing the XJoinResultsFactory interface:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class OfferXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  private String field;
  private String discountField;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
    field = (String)args.get("field");
    discountField = (String)args.get("discountField");
  }

  /**
   * Use 'offers' REST API to fetch current offer data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    try (HttpConnection http = new HttpConnection(url)) {
      JsonArray offers = (JsonArray)http.getJson();
      return new OfferResults(offers);
    }
  }
   
  /**
   * Results of the external search - methods like getXXX() are used
   * to expose the property XXX in the SOLR results.
   */
  public class OfferResults implements XJoinResults {
    private JsonArray offers;
    
    public OfferResults(JsonArray offers) {
      this.offers = offers;
    }
    
    public int getCount() {
      return offers.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      List ids = new ArrayList<>();
      for (JsonValue offer : offers) {
        ids.add(((JsonObject)offer).getString(field));
      }
      return ids;
    }

    @Override
    public Object getResult(String joinIdStr) {
      for (JsonValue offer : offers) {
        String id = ((JsonObject)offer).getString(field);
        if (id.equals(joinIdStr)) {
          return new Offer(offer);
        }
      }
      return null;
    }
  }
  
  /**
   * A discount offer - methods like getXXX() are used to expose
   * properties that can be joined with each Solr result via the join
   * id field.
   */
  public class Offer {
    private JsonValue offer;
    
    public Offer(JsonValue offer) {
      this.offer = offer;
    }
    
    public double getDiscount() {
      return ((JsonObject)offer).getInt(discountField) * 0.01d;
    }
  }
}

Here, the init() method initialises the URL for the external API and the names of the values we want to pick out from the external data. The getResults() method connects to the external API – since in this example, the discounts do not depend on the user’s query, we don’t use the SolrParams argument at all. It returns an implementation of XJoinResults, which must be able to return a collection of join ids (so, the value of the join id field for each external result), and also be able to return an external result object given a join id. Together, the XJoinResults object and each external result object contain the results of the external search, exposed via getXXX() methods (which are mapped to properties called XXX) and (once everything is plumbed in) available to Solr for filtering, affected the scores of documents, or for inclusion in the results set.

Save the above as blog/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java. You’ll also need javax.json-1.0.4.jar, which you can download from here if you don’t already have it – place it in the blog directory. Compile the two Java source files, and create a JAR to contain the resulting .class files:

blog$ mkdir bin
blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java
blog$ jar cvf offer.jar -C bin .

Configuring XJoin

So now – at last! – I’ll configure a Solr query handler that uses the XJoin Solr plugin components to add filters and boost queries based on the external data.

I’ll be working with blog/conf/solrconfig.xml now. The first thing to do is include the contrib JARs for XJoin and our glue code JAR (offer.jar) in <lib> directives near the top of the config file. To do that, add in the following snippet just under the <dataDir> directive:

<lib dir="${solr.install.dir:../../../..}/contrib/xjoin/lib" regex=".*\.jar" />
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-xjoin-\d.*\.jar" />
<lib path="/XXX/blog/javax.json-1.0.4.jar" />
<lib path="/XXX/blog/offer.jar" />

Here, you need to substitute /XXX with the full path to the parent of the blog directory.  (We need to include javax.json-1.0.4.jar because it’s a dependency of our offer.jar.) Now for the request handler config – I’ll include everything we’re going to need even though it won’t all be used straightaway:

<queryParser name="xjoin" class="org.apache.solr.search.xjoin.XJoinQParserPlugin" />

<valueSourceParser name="discount" class="org.apache.solr.search.xjoin.XJoinValueSourceParser">
  <str name="attribute">discount</str>
  <double name="defaultValue">0.0</double>
</valueSourceParser>

<searchComponent name="x_product_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">id</str>
  <lst name="external">
    <str name="url">http://localhost:8000/products</str>
    <str name="field">id</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<searchComponent name="x_manufacturer_offers" class="org.apache.solr.search.xjoin.XJoinSearchComponent">
  <str name="factoryClass">uk.co.flax.examples.xjoin.OfferXJoinResultsFactory</str>
  <str name="joinField">manufacturer</str>
  <lst name="external">
    <str name="url">http://localhost:8000/manufacturers</str>
    <str name="field">manufacturer</str>
    <str name="discountField">discountPct</str>
  </lst>
</searchComponent>

<requestHandler name="/xjoin" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="wt">json</str>
    <str name="echoParams">all</str>
    <str name="defType">edismax</str>
    <str name="df">description</str>
    <str name="fl">*</str>

    <bool name="x_product_offers">false</bool>
    <str name="x_product_offers.results">count</str>
    <str name="x_product_offers.fl">*</str>

    <bool name="x_manufacturer_offers">false</bool>
    <str name="x_manufacturer_offers.results">count</str>
    <str name="x_manufacturer_offers.fl">*</str>
  </lst>
  <arr name="first-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
  <arr name="last-components">
    <str>x_product_offers</str>
    <str>x_manufacturer_offers</str>
  </arr>
</requestHandler>

Insert this request handler config somewhere near the bottom of solrconfig.xml.

Using XJoin in a query

Let’s quickly get a query working, then I’ll explain what all the components that I’ve included do. Try this (remembering to escape curly brackets on the command line):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&fq=\{!xjoin\}x_product_offers&fl=id,name&rows=4' | jq .

You should see output like this (I’ve edited responseHeader.params for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 22,
    "params": {
      "x_product_offers": "true", 
      "x_product_offers.results": "count",
      "x_product_offers.fl": "*", 
      "q": "*", 
      "fq": "{!xjoin}x_product_offers", 
      "fl": "id,name",
      "rows": "4"
    }
  },
  "response": {
    "numFound": 64,
    "start": 0,
    "docs": [
      {
        "name": "did0480p-m311 plasmon additional maintenance 24x7 - plasmon diamond technical support - consul",
        "id": "http://www.google.com/base/feeds/snippets/13522752516373728128"
      },
      {
        "name": "apple ilife '06 family pack",
        "id": "http://www.google.com/base/feeds/snippets/10939909441298262260"
      },
      {
        "name": "adobe cs3 web standard upsell",
        "id": "http://www.google.com/base/feeds/snippets/8042583218932085904"
      },
      {
        "name": "the richard friedman trio motown hits - *(for the tg-100)*",
        "id": "http://www.google.com/base/feeds/snippets/17853905518738313346"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13522752516373728128",
        "doc": {
          "discount": 0.11
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/10939909441298262260",
        "doc": {
          "discount": 0.76
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/8042583218932085904",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17853905518738313346",
        "doc": {
          "discount": 0.05
        }
      }
    ]
  }
}

Here you can see the usual Solr output with our product documents in the response.docs array. Notice the value of response.numFound is only 64 out of a possible 3226. Additionally, we have an extra section, response.x_product_offers, that gives us results from the external offers API – count tells us the total number of external results found, and there is an external result object with a join id matching each hit in the Solr results.

The query we made to get these results is a combination of the parameters in the request handler, and those in the URL’s query string – I’ve left the pertinent ones in responseHeader.paramsThe first parameter, x_product_offers=true, turns on the XJoin component that talks to the offers API, so that at query time, it will make a connection and retrieve external results (note that in this case, no parameters are passed to the external API – the following blog post will demonstrate this). The following two parameters control which fields are output from the external results – the .results option is a field list which controls the fields returned from the OfferResults object (that’s our implementation of XJoinResults – see the code above – there is one OfferResults object per external request and it acts as a collection of the returned external results). Then the .fl option is another field list which controls the fields returned for each external result object – these values can be used for filtering, boosting, and so on (for more on which, see below).

The parameters q=*, fl=id,name and rows=4 have their usual effects. The really interesting parameter is the filter query:

fq={!xjoin}x_product_offers

This uses Solr local parameters “short-form” syntax to reference the XJoinQParserPlugin that was set up in solrconfig.xml (it doesn’t take any initialisation parameters). This component uses the join ids from the referenced XJoin component to create a query that ORs together terms like join_field:join_id (one for each external result). It is based on the Solr built-in TermsQParserPlugin and supports the same method parameter (but this can usually be omitted). So, here, it makes a filter based on the join ids returned by the offers API – thus, only the products which have a current offer are returned.

Note that we could have used the same syntax in just the q parameter to achieve the same effect, but it’s more usual that a user full text query is specified in and a ‘join’ created using a filter query.

Using the XJoinValueSourceParser

The XJoinValueSourceParser component that we have configured in solrconfig.xml provides us with a function, discount, that we can use in a function query. I configured the component to extract the value of discount from external results, and we supply an XJoin component name as the argument – this is a reference to a set of external results.

This opens up lots of possibilities, for example, a search in which each product’s score is  boosted by a reciprocal function of the price including discount (so cheaper products, after discounting, are boosted higher):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2&fl=id,price,score&rows=4' | jq .

which results in a response something like (again, with responseHeader.params edited for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 55,
    "params": {
       "x_product_offers": "true", 
       "x_product_offers.results": "count", 
       "x_product_offers.fl": "*", 
       "q": "*", 
       "bf": "recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2",
       "fl": "id,price,score",
       "rows": "4"
     }
   },
  "response": {
    "numFound": 3226,
    "start": 0,
    "maxScore": 1.3371909,
    "docs": [
      {
        "id": "http://www.google.com/base/feeds/snippets/549551716004314019",
        "price": 0.5,
        "score": 1.3371909
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "price": 8.49,
        "score": 1.325241
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "price": 9.9,
        "score": 1.3166784
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/18427513736767114578",
        "price": 2.99,
        "score": 1.3156738
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "doc": {
          "discount": 0.71
        }
      }
    ]
  }
}

This time, because we haven’t applied on a filter based on the external join ids, we still have the full set of documents in the results set (3226 in total). Note that although there are 4 results in response.docs (as requested by rows=4), there are only 2 external results in x_product_offers.external – this is because only 2 of those 4 Solr documents have matching external results (in that they have the same value of join id in the join field, which in this case is the product id). In other words, only 2 out of the 4 products returned have discounts offered.

To achieve the price boost, instead of a filter query, we have a boost function:

bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2

For each Solr document in the results set, the value of the expression discount(x_product_offers) is found by calling getDiscount() on the matching external result  in the x_product_offers XJoin search component. When there is no matching external result, the default value 0.0 is used, as configured for the value source parser in solrconfig.xml, which is equivalent to a 0% discount.

Of course, instead of the match-all q=* query, we can do an actual product search with our price boost, for example, q=apple. To be more sophisticated, we can also use the edismax parameter qf to query across both the name and description fields and weight them as we desire, for example, qf=name^4 description^2 or similar.

Joining on a field other than the unique id field

The join field does not have to correspond to the Solr unique id field. As seen above, the offers web API also returns discounts based on manufacturer (using the /manufacturers end-point). I configured another XJoin search component in solrconfig.xml called x_manufacturer_offers, the only differences from x_product_offers being the join field, which is now manufacturer, and the field which is taken from the external results to be the join value, which is of course the same, manufacturer.

So, now for example we can do a weighted query for “games software”, but restricting to products that have a manufacturer discount of at least 20%:

blog$ curl 'localhost:8983/solr/products/xjoin?q=software&qf=name^4+description^2&x_manufacturer_offers=true&fq=\{!frange+l=0.2\}discount(x_manufacturer_offers)&fl=*&rows=4' | jq .

See FunctionRangeQParserPlugin for details of the filter query used in this search. This gives something like (responseHeader.params omitted this time):

{
  "responseHeader": {
    "status": 0,
    "QTime": 4
  },
  "response": {
    "numFound": 25,
    "start": 0,
    "maxScore": 1.1224447,
    "docs": [
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/7436299398173390476",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17001745805951209994",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/10584509515076384561",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329559531500
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17283219592038470822",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329681166300
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "freeverse software",
        "doc": {
          "discount": 0.2
        }
      }
    ]
  }
}

In this case, there was only one manufacturer represented in the requested top 4 rows of the Solr results set.

Using two XJoin components in the same query

It’s worth noting that you can use more than one XJoin component in the same query. You can come up with more complicated examples, but this one shows how to query for all products that have a manufacturer discount as well as a product discount:

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&x_manufacturer_offers=true&fq=\{!xjoin\}x_product_offers&fq=\{!xjoin\}x_manufacturer_offers&fl=id,name,manufacturer&rows=4&wt=json' | jq .

You might have to try again a few times before you get a non-empty result set – here’s one I got:

{
  "responseHeader": {
    "status": 0,
    "QTime": 7
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "name": "apple software m8789z/a webobjects 5.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/4776201646741876078"
      },
      {
        "name": "apple software m9301z/b soundtrack v1.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/16537637847870148950"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/4776201646741876078",
        "doc": {
          "discount": 0.59
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/16537637847870148950",
        "doc": {
          "discount": 0.22
        }
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "apple software",
        "doc": {
          "discount": 0.3
        }
      }
    ]
  }
}

So you can see that there are two external results sections, one for product offers and one for manufacturer offers, and how the offers are matched to the products by the join ids (which is either the product id, or the manufacturer).

Next time…

In my next blog post, I’ll dive in to another demonstration of XJoin, in which I show how to use click-through data to influence the score of subsequent searches.

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2016/01/25/xjoin-solr-part-1-filtering-using-price-discount-data/feed/ 5
Introducing Luwak, a library for high-performance stored queries http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/ http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/#comments Fri, 06 Dec 2013 14:24:35 +0000 http://www.flax.co.uk/blog/?p=1064 A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of … More

The post Introducing Luwak, a library for high-performance stored queries appeared first on Flax.

]]>
A few weeks ago we spoke in Dublin at Lucene Revolution 2013 on our work in the media monitoring sector for various clients including Gorkana and Australian Associated Press. These organisations handle a huge number (sometimes hundreds of thousands) of news articles every day and need to apply tens of thousands of stored expressions to each one, which would be extremely inefficient if done with standard search engine libraries. We’ve developed a much more efficient way to achieve the same result, by pre-filtering the expressions before they’re even applied: effectively we index the expressions and use the news article itself as a query, which led to the presentation title ‘Turning Search Upside Down’.

We’re pleased to announce the core of this process, a Java library we’ve called Luwak, is now available as open source software for your own projects. Here’s how you might use it:

Monitor monitor = new Monitor(new TermFilteredPresearcher()); /* Create a new monitor */

MonitorQuery mq = new MonitorQuery("query1", new TermQuery(new Term(textfield, "test")));
monitor.update(mq); /* Create and load a stored query with a single term */

InputDocument doc = InputDocument.builder("doc1")
.addField(textfield, document, WHITESPACE)
.build(); /* Load a document (which could be a news article) */

DocumentMatches matches = monitor.match(doc); /* Retrieve which queries it matches */

The library is based on our own fork of the Apache Lucene library (as Lucene doesn’t yet have a couple of features we need, although we expect these to end up in a release version of Lucene very soon). Our own tests have produced speeds of up to 70,000 stored queries applied to an article in around a second on modest hardware. Do let us know any feedback you have on Luwak – we think it may be useful for various monitoring and classification tasks where high throughput is necessary.

The post Introducing Luwak, a library for high-performance stored queries appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2013/12/06/introducing-luwak-a-library-for-high-performance-stored-queries/feed/ 14
Strange bedfellows? The rise of cloud based search http://www.flax.co.uk/blog/2012/06/08/strange-bedfellows-the-rise-of-cloud-based-search/ http://www.flax.co.uk/blog/2012/06/08/strange-bedfellows-the-rise-of-cloud-based-search/#respond Fri, 08 Jun 2012 09:59:47 +0000 http://www.flax.co.uk/blog/?p=780 Last night our US partners Lucid Imagination announced that LucidWorks, their packaged and supported version of Apache Lucene/Solr, is available on Microsoft’s Azure cloud computing service. It seems like only a few weeks since Amazon announced their own CloudSearch system … More

The post Strange bedfellows? The rise of cloud based search appeared first on Flax.

]]>
Last night our US partners Lucid Imagination announced that LucidWorks, their packaged and supported version of Apache Lucene/Solr, is available on Microsoft’s Azure cloud computing service. It seems like only a few weeks since Amazon announced their own CloudSearch system and no doubt other ‘search as a service’ providers are waiting in the wings (we’re going to need a new acronym as SaaS is already taken!). At first the combination of a search platform based on open source Java code with Microsoft hosting might seem strange, and it raises some interesting questions about the future of Microsoft’s own FAST Search technology – is this final proof that FAST will only ever be part of Sharepoint and never a standalone product? However with search technology becoming more and more of a commodity this is a great option for customers looking for search over relatively small numbers of documents.

Lucid’s offering is considerably more flexible and full-featured than Amazon’s, which we hear is pretty basic with a lack of standard search features like contextual snippets and a number of bugs in the client software. You can see the latter in action at Runar Buvik’s excellent OpenTestSearch website. With prices for the Lucid service ranging from free for small indexes, this is certainly an option worth considering.

The post Strange bedfellows? The rise of cloud based search appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/06/08/strange-bedfellows-the-rise-of-cloud-based-search/feed/ 0
NOT WITHIN queries in Lucene http://www.flax.co.uk/blog/2012/02/22/not-within-queries-in-lucene/ http://www.flax.co.uk/blog/2012/02/22/not-within-queries-in-lucene/#comments Wed, 22 Feb 2012 09:29:00 +0000 http://www.flax.co.uk/blog/?p=718 A guest post from Alan Woodward who has joined the Flax team recently: I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser … More

The post NOT WITHIN queries in Lucene appeared first on Flax.

]]>
A guest post from Alan Woodward who has joined the Flax team recently:

I’ve been working on migrating a client from a legacy dtSearch platform to a new system based on Lucene, part of which involves writing a query parser to translate their existing dtSearch queries into Lucene Query objects. dtSearch allows you to perform proximity searches – find documents with term A within X positions of term B – which can be reproduced using Lucene SpanQueries (a good introduction to span queries can be found on the Lucid Imagination blog). SpanQueries search for Spans – a start term, an end term, and an edit distance. So to search for “fish” within two positions of “chips”, you’d create a SpanNearQuery, passing in the terms “fish” and “chips” and an edit distance of 2.

You can also search for terms that are not within X positions of another term. This too is possible to achieve with SpanQueries, with a bit of trickery.

Let’s say we have the following document:

fish and chips is nicer than fish and jam

We want to match documents that contain the term ‘fish’, but not if it’s within two positions of the term ‘chips’ – the relevant dtSearch syntax here is “fish” NOT WITHIN/2 “chips”. A query of this type should return the document above, as the second instance of the term ‘fish’ matches our criteria. We can’t just negate a normal “fish” WITHIN/2 “chips” query, as that won’t match our document. We need to somehow distinguish between tokens within a document based on their context.

Enter the SpanNotQuery. A SpanNotQuery takes two SpanQueries, and returns all documents that have instances of the first Span that do not overlap with instances of the second. The Lucid Imagination post linked above gives the example of searching for “George Bush” – say you wanted documents relating to George W Bush, but not to George H W Bush. You could create a SpanNotQuery that looked for “George” within 2 positions of “Bush”, not overlapping with “H”.

In our specific case, we want to find instances of “fish” that do not overlap with Spans of “fish” within/2 “chips”. So to create our query, we need the following:

int distance = 2;
boolean ordered = true;
SpanQuery fish = new SpanTermQuery(new SpanTerm(FIELD, "fish"));
SpanQuery chips = new SpanTermQuery(new SpanTerm(FIELD, "chips"));
SpanQuery fishnearchips = new SpanNearQuery(new SpanQuery[] { fish, chips },
distance, ordered);

Query q = new SpanNotQuery(fish, fishnearchips);

It’s a bit verbose, but that’s Java for you.

The post NOT WITHIN queries in Lucene appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2012/02/22/not-within-queries-in-lucene/feed/ 3
Enterprise Search London – Financial applications, SBA book and Solr searching 120m documents http://www.flax.co.uk/blog/2011/02/10/enterprise-search-london-financial-applications-sba-book-and-solr-searching-120m-documents/ http://www.flax.co.uk/blog/2011/02/10/enterprise-search-london-financial-applications-sba-book-and-solr-searching-120m-documents/#respond Thu, 10 Feb 2011 10:49:15 +0000 http://www.flax.co.uk/blog/?p=507 Another excellent evening as part of the Enterprise Search London Meetup series; very busy as usual. Amir Dotan started us off with details of his work in designing user interfaces for the financial services sector, describing some of the challenges … More

The post Enterprise Search London – Financial applications, SBA book and Solr searching 120m documents appeared first on Flax.

]]>
Another excellent evening as part of the Enterprise Search London Meetup series; very busy as usual.

Amir Dotan started us off with details of his work in designing user interfaces for the financial services sector, describing some of the challenges involved in designing for a high-pressure and highly regulated environment. Although he didn’t talk about search specifically we heard a lot about how to design useful interfaces. Two quotes stood out: “The right user interface can help make billions”, and as a way to get feedback “find someone nice in the business and never let them go”.

Gregory Grefenstette of Exalead was next, talking about his new book on Search Based Applications. He explained how SBAs have advantages over traditional databases in the three areas of agility, usability and performance and went on to show some examples, before an unfortunate combination of a broken slide deck and a failing laptop battery brought him to a halt: in retrospect a great advertisement for a physical book over a computer!

Upayavira of Sourcesense was next with details of a new search built for online news aggregator Moreover. This dealt with scaling Lucene/Solr to cope with indexing 2 million new documents a day, for a rolling 2 month index. He showed how some initial memory and performance problems had been solved with a combination of pre-warming caches, tweaks to the JVM and Java garbage collector and eventually profiling of their custom code. Particularly interesting was how they had developed a system for spinning up a complete copy of the searchable database (for load balancing purposes) on the Amazon EC2 cloud – from a standing start they can allocate servers, install software and copy across searchable indexes in around 40 minutes. This was a great demonstration of the power of the open source model – no more licenses to buy! Search performance over this large collection is pretty good as well, with faceted queries returning in a second or two and unfaceted in half a second.

We also heard from Martin White about an exciting new search related conference to be held in October this year in London in association with Information Today, Inc., and I managed a quick plug for our inaugural Cambridge Enterprise Search Meetup on Wednesday 16th February.

The post Enterprise Search London – Financial applications, SBA book and Solr searching 120m documents appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2011/02/10/enterprise-search-london-financial-applications-sba-book-and-solr-searching-120m-documents/feed/ 0
Flax partners with Lucid Imagination http://www.flax.co.uk/blog/2010/10/04/flax-partners-with-lucid-imagination/ http://www.flax.co.uk/blog/2010/10/04/flax-partners-with-lucid-imagination/#respond Mon, 04 Oct 2010 09:37:26 +0000 http://www.flax.co.uk/blog/?p=363 We’re very happy to announce that we’ve been selected as an Authorized Partner by Lucid Imagination, the commercial company for Lucene and Solr. You can read the press release as a PDF here. Apache Lucene and Solr, available as open … More

The post Flax partners with Lucid Imagination appeared first on Flax.

]]>
We’re very happy to announce that we’ve been selected as an Authorized Partner by Lucid Imagination, the commercial company for Lucene and Solr. You can read the press release as a PDF here.

Apache Lucene and Solr, available as open source software from the Apache Software Foundation, are powerful, scalable, reliable and fully-featured search technologies. Solr is the Lucene Search Server, making it easy to build search applications for the enterprise.

With our long experience of customising, installing and supporting open source search engines, this partnership is a natural fit for us, and we’re excited by the opportunities it presents. In addition to our current offerings, Flax will now offer installation, integration and commercial support packages for Lucene and Solr, backed by Lucid Imagination.

The post Flax partners with Lucid Imagination appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2010/10/04/flax-partners-with-lucid-imagination/feed/ 0
Open source search engines and programming languages http://www.flax.co.uk/blog/2010/09/03/open-source-search-engines-and-programming-languages/ http://www.flax.co.uk/blog/2010/09/03/open-source-search-engines-and-programming-languages/#comments Fri, 03 Sep 2010 10:40:16 +0000 http://www.flax.co.uk/blog/?p=352 So you’re writing a search-related application in your favourite language, and you’ve decided to choose an open source search engine to power it. So far, so good – but how are the two going to communicate? Let’s look at two … More

The post Open source search engines and programming languages appeared first on Flax.

]]>
So you’re writing a search-related application in your favourite language, and you’ve decided to choose an open source search engine to power it. So far, so good – but how are the two going to communicate?

Let’s look at two engines, Xapian and Lucene, and compare how this might be done. Lucene is written in Java, Xapian in C/C++ – so if you’re using those languages respectively, everything should be relatively simple – just download the source code and get on with it. However if this isn’t the case, you’re going to have to work out how to interface to the engine.

The Lucene project has been rewritten in several other languages: for C/C++ there’s Lucy (which includes Perl and Ruby bindings), for Python there’s PyLucene, and there’s even a .Net version called, not surprisingly, Lucene.NET. Some of these ‘ports’ of Lucene are ‘looser’ than others (i.e. they may not share the same API or feature set), and they may not be updated as often as Lucene itself. There are also versions in Perl, Ruby, Delphi or even Lisp (scary!) – there’s a full list available. Not all are currently active projects.

Xapian takes a different approach, with only one core project, but a sheaf of bindings to other languages. Currently these bindings cover C#, Java, Perl, PHP, Python, Ruby and Tcl – but interestingly these are auto-generated using the Simplified Wrapper and Interface Generator or SWIG. This means that every time Xapian’s API changes, the bindings can easily be updated to reflect this (it’s actually not quite that simple, but SWIG copes with the vast majority of code that would otherwise have to be manually edited). SWIG actually supports other languages as well (according to the SWIG website, “Common Lisp (CLISP, Allegro CL, CFFI, UFFI), Lua, Modula-3, OCAML, Octave and R. Also several interpreted and compiled Scheme implementations (Guile, MzScheme, Chicken)”) so in theory bindings to these could also be built relatively easily.

There’s also another way to communicate with both engines, using a search server. SOLR is the search server for Lucene, whereas for Xapian there is Flax Search Service. In this case, any language that supports Web Services (you’d be hard pressed to find a modern language that doesn’t) can communicate with the engine, simply passing data over the HTTP protocol.

The post Open source search engines and programming languages appeared first on Flax.

]]>
http://www.flax.co.uk/blog/2010/09/03/open-source-search-engines-and-programming-languages/feed/ 1