python – Flax

Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others

Charlie Hull — Mon, 15 Feb 2016 11:32:13 +0000

Over the last 18 months we’ve been working closely with the European Bioinformatics Institute on a project to improve their use of open source search engines, funded by the BBSRC. The project was originally named BioSolr but has since grown to encompass Elasticsearch. Last week we held a two-day workshop on the Wellcome Genome Campus near Cambridge to showcase our achievements and hear from others working in the same field, focused on Solr on the first day and Elasticsearch and other solutions on the second. Attendees included both bioinformaticians and search experts, as the project has very much been about collaboration and learning from each other.Read about the first day here.

The second day started with Eric Pugh’s second talk on The (Unofficial) State of Elasticsearch, bringing us all up to date on the meteoric rise of this technology and the opportunities it opens up especially in analytics and visualisation. Eric foresees Elastisearch continuing to specialise in this area, with Solr sticking closer to its roots in information retrieval. Giovanni Tumarello followed with a fast-paced demonstration of Kibi, a platform built on Elasticsearch and Kibana. Kibi allows one to very quickly join, visualise and explore different data sets and I was impressed with the range of potential applications including in the life sciences.

Evan Bolton of the US-based NCBI was next, talking about the massive PubChem dataset (80 million unique chemical structures, 200 million chemical substance descriptions, and 230 million biological activities, all heavily crosslinked). Although both Solr and CLucene had been considered, they eventually settled on the Sphinx engine with its great support for SQL queries and JOINs, although Evan admitted this was not a cloud-friendly solution. His team are now considering knowledge graphs and how to present up to 100 billion RDF triples. Andrea Pierleoni of the Centre for Therapeutic Target Validation then talked about an Elasticsearch cluster he has developed to index ‘evidence strings’ (which relate targets to diseases using evidence). This is a relatively small collection of 2.1 million association objects, pre-processed using Python and stored in Redis before indexing.

Next up was Nikos Marinos from the EBI Literature Services team talking about their recent migration from Lucene to Solr. As he explained most of this was a straightforward task, with one wrinkle being the use of DIH Transformers where array data was used. Rafael Jimenez then talked about projects he has worked on using both Elasticsearch and Solr, and stressed the importance of adhering to open standards and re-use of software where possible – key strengths of open source of course. Michal Nowotka then talked about a proposed system to replace the current ChEMBL search using Solr and django-haystack (the latter allows one to use a variety of underlying search engines from Django). Finally, Nicola Buso talked about EBISearch, based on Lucene.

We then concluded with another hands-on session, more aimed at Elasticsearch this time. As you can probably tell we had been shown a huge variety of different search needs and solutions using a range of technologies over the two days and it was clear to me that the BioSolr project is only a small first step towards improving the software available – we have applied for further funding and we hope to have good news soon! Working with life science data, often at significant scale, has been fascinating.

Most of the presentations are now available for download. Thanks to all the presenters (especially those who travelled from abroad), the EBI for kindly hosting the event and in particular to Dr Sameer Velankar who has been the driving force behind this project.

The post Better search for life sciences at the BioSolr Workshop, day 2 – Elasticsearch & others appeared first on Flax.

XJoin for Solr, part 1: filtering using price discount data

Tom Winch — Mon, 25 Jan 2016 10:04:28 +0000

In this blog post I want to introduce you to a new Apache Solr plugin component called XJoin. I’ll show how we can use this to solve a common problem in e-commerce – how to use price discount data, provided by an external web API, to either filter the results of a product search or boost scores. A further post will show another example, using click-through data to influence the score of subsequent searches.

What is XJoin?

The XJoin component can be used when you want values from some source external to Solr to filter or influence the score of hits in your Solr result set. It is currently available as a Solr patch on the XJoin JIRA ticket SOLR-7341, so to use it, you’ll need to check out a version of Apache Lucene/Solr using Subversion, then patch and build it (see below for details).

The XJoin patch was developed as part of the BioSolr project but it is not specific to bioinformatics and can be used in any situation where you want to use data from an external source to influence the results of a Solr search. (Other joins, including cross-core joins, are available – but you need XJoin if the data you are joining against is not in Solr.). We’ll be talking about XJoin and the other features we’ve developed for both Solr and Elasticsearch, including powerful ontology indexing, at a workshop at the European Bioinformatics Institute next week.

Patching SOLR

I’m going to be using Solr version 5.3 for this blog. If you’re following along, check out a clean copy using Subversion:

$ svn co https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_5_3

Download the XJoin patch (find the one corresponding to this version of Solr on the JIRA ticket) into the newly checked-out directory, and apply it:

lucene_solr_5_3$ svn patch SOLR-7341.patch-5_3

And then build Solr from the solr sub-directory:

lucene_solr_5_3/solr$ ant server

We should now be able to start the patched Solr server:

lucene_solr_5_3/solr$ bin/solr start

Indexing a sample product data set

I’ll be using a sample Google product feed, GoogleProducts.csv, which I got from here. Create a new directory called blog (mine has the same parent as my Solr check-out) and download the sample into it. It’s in CSV format, with columns for product id, name, description, manufacturer and price. Indexing this will be a piece of cake!

We’ll begin with a copy of the sample Solr config directory:

blog$ cp -r ../lucene_solr_5_3/solr/server/solr/configsets/basic_configs/conf .

Modify conf/schema.xml so that our Solr documents have fields corresponding to those in the CSV file:

Naturally, the product id will serve as the Solr unique key:

id

We can use the sample solrconfig.xml as is for now. Add a core called products using the Solr core admin UI (as you started a Solr server above, this should be available at http://localhost:8983/solr/#/~cores). The values for instanceDir and dataDir will both be the full path of the blog directory.

I’ll be using Python to index the product data. The code is written for Python 3, and won’t work in Python 2.x because of character encoding issues in the csv module, but you can fix it by using a UTF8Recoder as described in the module documentation. Here’s my indexing script (note that all the code written for this example is also available in the BioSolr GitHub repository):

import sys
import csv
import json
import requests

def value(k, v):
    return k, v.strip() if k != 'price' else float(v.split()[0])

def read(path):
    with open(path, encoding='iso-8859-1') as f:
        reader = csv.DictReader(f)
        for doc in reader:
            yield dict(value(k, v) for k, v in doc.items()
                       if len(v.strip()) > 0)

def index(url, docs):
    print("Sending {0} documents to {1}".format(len(docs), url))
    data = json.dumps(docs)
    headers = { 'content-type': 'application/json' }
    r = requests.post(url, data=data, headers=headers)
    if r.status_code != 200:
      raise IOError("Bad SOLR update")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: {0}  ".format(sys.argv[0]))
        sys.exit(1)

    docs = list(read(sys.argv[2]))
    index(sys.argv[1], docs)

The script tidies up the prices because they aren’t consistently formatted, converting them to float values. Save the script in index.py and use it to index the Google product data into Solr (let’s force commits, just to be sure):

blog$ python3 index.py http://localhost:8983/solr/products/update?commit=true GoogleProducts.csv

And, lo and behold, we can see our data in Solr using cURL (I like to pipe the output through jq to get nicely formatted JSON):

curl 'localhost:8983/solr/products/select?wt=json&q=*' | jq .

So, using Solr we’ve now built a full text product search in only a few minutes, with potentially all the add-ons Solr provides out of the box. However, suppose there is supplementary information about the products, available from an external source (which might not be under our control).

I will now demonstrate how to configure Solr so that during a product search, the external source is also queried (either with the same user query or something different) and the resulting external data used to influence the result set. Each external result is ‘joined’ against a Solr document via a ‘join field’ or ‘join id’, which doesn’t have to be the Solr unique id (in the examples below I use the product id and manufacturer as the join fields). To get an ‘inner join’ I will use the XJoinQParserPlugin to turn the external ids into a filter query, but it’s also possible to build boost queries or use the XJoinValueSourceParser to use external values in a boost function. You can see all this implemented below.

Product discount offers example

In the first of my examples, I’ll set up filtering and score boosting based on discount offers, the external source for which is going to be a web service, which I’m going to make available locally on the URL http://localhost:8000/offers. Again, I’ll implement this in Python, using the popular Flask web server micro-framework and the module requests. Install both of these using pip (I need sudo, but you might not):

blog$ sudo pip install flask requests

Creating the external source

Here’s my code for the product offers web API:

from flask import Flask
from index import read
import json
import random
import sys

app = Flask(__name__)

@app.route('/')
def main():
    return json.dumps({ 'info': 'product offers API' })

@app.route('/products')
def products():
    offer = lambda doc: {
                'id': doc['id'],
                'discountPct': random.randint(1, 80)
            }
    return json.dumps([offer(doc) for doc
                       in random.sample(app.docs, 64)])

@app.route('/manufacturers')
def manufacturer():
  manufacturers = set(doc['manufacturer'] for doc in app.docs
                      if 'manufacturer' in doc)
  deal = lambda m: {
             'manufacturer': m,
             'discountPct': random.randint(1, 10) * 5
         }
  return json.dumps([deal(m) for m
                     in random.sample(manufacturers, 3)])

if __name__ == "__main__":
  if len(sys.argv) < 2:
    print("Usage: {0} ".format(sys.argv[0]))
    sys.exit(1)

  app.docs = list(read(sys.argv[1]))
  app.run(port=8000, debug=True)

The code generates discounts for a random selection of products and manufacturers. Save it to blog/offer.py and start the server, supplying the Google products CSV file on the command line:

blog$ python3 offer.py GoogleProducts.csv

Now, test it out using cURL (again, I like to pipe through jq to get nicely formatted JSON):

$ curl -s localhost:8000/products | jq .

You should see a list of objects, each with a product id and a discount percentage, something like:

[
  {
    "discountPct": 41,
    "id": "http://www.google.com/base/feeds/snippets/18100341066456401733"
  },
  {
    "discountPct": 63,
    "id": "http://www.google.com/base/feeds/snippets/16969493842479402672"
  },
  {
    "discountPct": 13,
    "id": "http://www.google.com/base/feeds/snippets/10357785197400989441"
  },
  {
    "discountPct": 35,
    "id": "http://www.google.com/base/feeds/snippets/2813321165033737171"
  },
  {
    "discountPct": 27,
    "id": "http://www.google.com/base/feeds/snippets/15203735208016659510"
  },
  ...
]

You get similar output if you use the /manufacturers endpoint:

$ curl -s localhost:8000/manufacturers | jq .

This time, we get a shorter list, of manufacturers each with a discount percentage, for example:

[
  {
    "discountPct": 15,
    "manufacturer": "freeverse software"
  },
  {
    "discountPct": 5,
    "manufacturer": "pinnacle systems"
  },
  {
    "discountPct": 50,
    "manufacturer": "destineer inc"
  }
]

Creating XJoin glue code

To bridge the gap between Solr and our external data source, XJoin requires some glue code, written in Java, to query the source and return the results. First, I’ll create a quick utility class to help with HTTP connections:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.URL;

import javax.json.Json;
import javax.json.JsonReader;
import javax.json.JsonStructure;

public class HttpConnection implements AutoCloseable {
  private HttpURLConnection http;
  
  public HttpConnection(String url) throws IOException {
    http = (HttpURLConnection)new URL(url).openConnection();
  }
  
  public JsonStructure getJson() throws IOException {
    http.setRequestMethod("GET");
    http.setRequestProperty("Accept", "application/json");
    try (InputStream in = http.getInputStream();
         JsonReader reader = Json.createReader(in)) {
      return reader.read();
    }
  }
  
  @Override
  public void close() {
    http.disconnect();
  }
}

Save this as blog/java/uk/co/flax/examples/xjoin/HttpConnection.java. The glue code we need is fairly simple, and can be written as a single class, implementing the XJoinResultsFactory interface:

package uk.co.flax.examples.xjoin;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import javax.json.JsonArray;
import javax.json.JsonObject;
import javax.json.JsonValue;

import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.search.xjoin.XJoinResults;
import org.apache.solr.search.xjoin.XJoinResultsFactory;

public class OfferXJoinResultsFactory
implements XJoinResultsFactory {
  private String url;
  private String field;
  private String discountField;
  
  @Override
  @SuppressWarnings("rawtypes")
  public void init(NamedList args) {
    url = (String)args.get("url");
    field = (String)args.get("field");
    discountField = (String)args.get("discountField");
  }

  /**
   * Use 'offers' REST API to fetch current offer data. 
   */
  @Override
  public XJoinResults getResults(SolrParams params)
  throws IOException {
    try (HttpConnection http = new HttpConnection(url)) {
      JsonArray offers = (JsonArray)http.getJson();
      return new OfferResults(offers);
    }
  }
   
  /**
   * Results of the external search - methods like getXXX() are used
   * to expose the property XXX in the SOLR results.
   */
  public class OfferResults implements XJoinResults {
    private JsonArray offers;
    
    public OfferResults(JsonArray offers) {
      this.offers = offers;
    }
    
    public int getCount() {
      return offers.size();
    }
    
    @Override
    public Iterable getJoinIds() {
      List ids = new ArrayList<>();
      for (JsonValue offer : offers) {
        ids.add(((JsonObject)offer).getString(field));
      }
      return ids;
    }

    @Override
    public Object getResult(String joinIdStr) {
      for (JsonValue offer : offers) {
        String id = ((JsonObject)offer).getString(field);
        if (id.equals(joinIdStr)) {
          return new Offer(offer);
        }
      }
      return null;
    }
  }
  
  /**
   * A discount offer - methods like getXXX() are used to expose
   * properties that can be joined with each Solr result via the join
   * id field.
   */
  public class Offer {
    private JsonValue offer;
    
    public Offer(JsonValue offer) {
      this.offer = offer;
    }
    
    public double getDiscount() {
      return ((JsonObject)offer).getInt(discountField) * 0.01d;
    }
  }
}

Here, the init() method initialises the URL for the external API and the names of the values we want to pick out from the external data. The getResults() method connects to the external API – since in this example, the discounts do not depend on the user’s query, we don’t use the SolrParams argument at all. It returns an implementation of XJoinResults, which must be able to return a collection of join ids (so, the value of the join id field for each external result), and also be able to return an external result object given a join id. Together, the XJoinResults object and each external result object contain the results of the external search, exposed via getXXX() methods (which are mapped to properties called XXX) and (once everything is plumbed in) available to Solr for filtering, affected the scores of documents, or for inclusion in the results set.

Save the above as blog/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java. You’ll also need javax.json-1.0.4.jar, which you can download from here if you don’t already have it – place it in the blog directory. Compile the two Java source files, and create a JAR to contain the resulting .class files:

blog$ mkdir bin
blog$ javac -sourcepath src/java -d bin -cp javax.json-1.0.4.jar:../lucene_solr_5_3/solr/dist/solr-solrj-5.3.2-SNAPSHOT.jar:../lucene_solr_5_3/solr/dist/solr-xjoin-5.3.2-SNAPSHOT.jar src/java/uk/co/flax/examples/xjoin/OfferXJoinResultsFactory.java
blog$ jar cvf offer.jar -C bin .

Configuring XJoin

So now – at last! – I’ll configure a Solr query handler that uses the XJoin Solr plugin components to add filters and boost queries based on the external data.

I’ll be working with blog/conf/solrconfig.xml now. The first thing to do is include the contrib JARs for XJoin and our glue code JAR (offer.jar) in directives near the top of the config file. To do that, add in the following snippet just under the directive:

Here, you need to substitute /XXX with the full path to the parent of the blog directory. (We need to include javax.json-1.0.4.jar because it’s a dependency of our offer.jar.) Now for the request handler config – I’ll include everything we’re going to need even though it won’t all be used straightaway:




  discount
  0.0



  uk.co.flax.examples.xjoin.OfferXJoinResultsFactory
  id
  
    http://localhost:8000/products
    id
    discountPct
  



  uk.co.flax.examples.xjoin.OfferXJoinResultsFactory
  manufacturer
  
    http://localhost:8000/manufacturers
    manufacturer
    discountPct
  



  
    json
    all
    edismax
    description
    *

    false
    count
    *

    false
    count
    *
  
  
    x_product_offers
    x_manufacturer_offers
  
  
    x_product_offers
    x_manufacturer_offers

Insert this request handler config somewhere near the bottom of solrconfig.xml.

Using XJoin in a query

Let’s quickly get a query working, then I’ll explain what all the components that I’ve included do. Try this (remembering to escape curly brackets on the command line):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&fq=\{!xjoin\}x_product_offers&fl=id,name&rows=4' | jq .

You should see output like this (I’ve edited responseHeader.params for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 22,
    "params": {
      "x_product_offers": "true", 
      "x_product_offers.results": "count",
      "x_product_offers.fl": "*", 
      "q": "*", 
      "fq": "{!xjoin}x_product_offers", 
      "fl": "id,name",
      "rows": "4"
    }
  },
  "response": {
    "numFound": 64,
    "start": 0,
    "docs": [
      {
        "name": "did0480p-m311 plasmon additional maintenance 24x7 - plasmon diamond technical support - consul",
        "id": "http://www.google.com/base/feeds/snippets/13522752516373728128"
      },
      {
        "name": "apple ilife '06 family pack",
        "id": "http://www.google.com/base/feeds/snippets/10939909441298262260"
      },
      {
        "name": "adobe cs3 web standard upsell",
        "id": "http://www.google.com/base/feeds/snippets/8042583218932085904"
      },
      {
        "name": "the richard friedman trio motown hits - *(for the tg-100)*",
        "id": "http://www.google.com/base/feeds/snippets/17853905518738313346"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13522752516373728128",
        "doc": {
          "discount": 0.11
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/10939909441298262260",
        "doc": {
          "discount": 0.76
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/8042583218932085904",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17853905518738313346",
        "doc": {
          "discount": 0.05
        }
      }
    ]
  }
}

Here you can see the usual Solr output with our product documents in the response.docs array. Notice the value of response.numFound is only 64 out of a possible 3226. Additionally, we have an extra section, response.x_product_offers, that gives us results from the external offers API – count tells us the total number of external results found, and there is an external result object with a join id matching each hit in the Solr results.

The query we made to get these results is a combination of the parameters in the request handler, and those in the URL’s query string – I’ve left the pertinent ones in responseHeader.params. The first parameter, x_product_offers=true, turns on the XJoin component that talks to the offers API, so that at query time, it will make a connection and retrieve external results (note that in this case, no parameters are passed to the external API – the following blog post will demonstrate this). The following two parameters control which fields are output from the external results – the .results option is a field list which controls the fields returned from the OfferResults object (that’s our implementation of XJoinResults – see the code above – there is one OfferResults object per external request and it acts as a collection of the returned external results). Then the .fl option is another field list which controls the fields returned for each external result object – these values can be used for filtering, boosting, and so on (for more on which, see below).

The parameters q=*, fl=id,name and rows=4 have their usual effects. The really interesting parameter is the filter query:

fq={!xjoin}x_product_offers

This uses Solr local parameters “short-form” syntax to reference the XJoinQParserPlugin that was set up in solrconfig.xml (it doesn’t take any initialisation parameters). This component uses the join ids from the referenced XJoin component to create a query that ORs together terms like join_field:join_id (one for each external result). It is based on the Solr built-in TermsQParserPlugin and supports the same method parameter (but this can usually be omitted). So, here, it makes a filter based on the join ids returned by the offers API – thus, only the products which have a current offer are returned.

Note that we could have used the same syntax in just the q parameter to achieve the same effect, but it’s more usual that a user full text query is specified in q and a ‘join’ created using a filter query.

Using the XJoinValueSourceParser

The XJoinValueSourceParser component that we have configured in solrconfig.xml provides us with a function, discount, that we can use in a function query. I configured the component to extract the value of discount from external results, and we supply an XJoin component name as the argument – this is a reference to a set of external results.

This opens up lots of possibilities, for example, a search in which each product’s score is boosted by a reciprocal function of the price including discount (so cheaper products, after discounting, are boosted higher):

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2&fl=id,price,score&rows=4' | jq .

which results in a response something like (again, with responseHeader.params edited for clarity):

{
  "responseHeader": {
    "status": 0,
    "QTime": 55,
    "params": {
       "x_product_offers": "true", 
       "x_product_offers.results": "count", 
       "x_product_offers.fl": "*", 
       "q": "*", 
       "bf": "recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2",
       "fl": "id,price,score",
       "rows": "4"
     }
   },
  "response": {
    "numFound": 3226,
    "start": 0,
    "maxScore": 1.3371909,
    "docs": [
      {
        "id": "http://www.google.com/base/feeds/snippets/549551716004314019",
        "price": 0.5,
        "score": 1.3371909
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "price": 8.49,
        "score": 1.325241
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "price": 9.9,
        "score": 1.3166784
      },
      {
        "id": "http://www.google.com/base/feeds/snippets/18427513736767114578",
        "price": 2.99,
        "score": 1.3156738
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/13704505045182265069",
        "doc": {
          "discount": 0.78
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/17894887781222328015",
        "doc": {
          "discount": 0.71
        }
      }
    ]
  }
}

This time, because we haven’t applied on a filter based on the external join ids, we still have the full set of documents in the results set (3226 in total). Note that although there are 4 results in response.docs (as requested by rows=4), there are only 2 external results in x_product_offers.external – this is because only 2 of those 4 Solr documents have matching external results (in that they have the same value of join id in the join field, which in this case is the product id). In other words, only 2 out of the 4 products returned have discounts offered.

To achieve the price boost, instead of a filter query, we have a boost function:

bf=recip(product(price,sub(1,discount(x_product_offers))),1,100,100)^2

For each Solr document in the results set, the value of the expression discount(x_product_offers) is found by calling getDiscount() on the matching external result in the x_product_offers XJoin search component. When there is no matching external result, the default value 0.0 is used, as configured for the value source parser in solrconfig.xml, which is equivalent to a 0% discount.

Of course, instead of the match-all q=* query, we can do an actual product search with our price boost, for example, q=apple. To be more sophisticated, we can also use the edismax parameter qf to query across both the name and description fields and weight them as we desire, for example, qf=name^4 description^2 or similar.

Joining on a field other than the unique id field

The join field does not have to correspond to the Solr unique id field. As seen above, the offers web API also returns discounts based on manufacturer (using the /manufacturers end-point). I configured another XJoin search component in solrconfig.xml called x_manufacturer_offers, the only differences from x_product_offers being the join field, which is now manufacturer, and the field which is taken from the external results to be the join value, which is of course the same, manufacturer.

So, now for example we can do a weighted query for “games software”, but restricting to products that have a manufacturer discount of at least 20%:

blog$ curl 'localhost:8983/solr/products/xjoin?q=software&qf=name^4+description^2&x_manufacturer_offers=true&fq=\{!frange+l=0.2\}discount(x_manufacturer_offers)&fl=*&rows=4' | jq .

See FunctionRangeQParserPlugin for details of the filter query used in this search. This gives something like (responseHeader.params omitted this time):

{
  "responseHeader": {
    "status": 0,
    "QTime": 4
  },
  "response": {
    "numFound": 25,
    "start": 0,
    "maxScore": 1.1224447,
    "docs": [
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/7436299398173390476",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 18.99,
        "name": "freeverse software 005 solace",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17001745805951209994",
        "description": "in the noble tradition of axis & alliestm freeverse software unleashes an epic strategy board game that's so addicting it will leave you sleep deprived and socially inept! in the noble tradition of axis & alliestm freeverse software unleashes an ...",
        "_version_": 1524074329499762700
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/10584509515076384561",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329559531500
      },
      {
        "price": 19.99,
        "name": "freeverse software 4001 northland",
        "manufacturer": "freeverse software",
        "id": "http://www.google.com/base/feeds/snippets/17283219592038470822",
        "description": "stand-alone real-time strategy game based on viking mythology description: stand-alone real-time strategy game based on viking mythology.game features:single player campaign with 8 missions including several sub missions. the exciting plots tells ...",
        "_version_": 1524074329681166300
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "freeverse software",
        "doc": {
          "discount": 0.2
        }
      }
    ]
  }
}

In this case, there was only one manufacturer represented in the requested top 4 rows of the Solr results set.

Using two XJoin components in the same query

It’s worth noting that you can use more than one XJoin component in the same query. You can come up with more complicated examples, but this one shows how to query for all products that have a manufacturer discount as well as a product discount:

blog$ curl 'localhost:8983/solr/products/xjoin?q=*&x_product_offers=true&x_manufacturer_offers=true&fq=\{!xjoin\}x_product_offers&fq=\{!xjoin\}x_manufacturer_offers&fl=id,name,manufacturer&rows=4&wt=json' | jq .

You might have to try again a few times before you get a non-empty result set – here’s one I got:

{
  "responseHeader": {
    "status": 0,
    "QTime": 7
  },
  "response": {
    "numFound": 2,
    "start": 0,
    "docs": [
      {
        "name": "apple software m8789z/a webobjects 5.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/4776201646741876078"
      },
      {
        "name": "apple software m9301z/b soundtrack v1.2",
        "manufacturer": "apple software",
        "id": "http://www.google.com/base/feeds/snippets/16537637847870148950"
      }
    ]
  },
  "x_product_offers": {
    "count": 64,
    "external": [
      {
        "joinId": "http://www.google.com/base/feeds/snippets/4776201646741876078",
        "doc": {
          "discount": 0.59
        }
      },
      {
        "joinId": "http://www.google.com/base/feeds/snippets/16537637847870148950",
        "doc": {
          "discount": 0.22
        }
      }
    ]
  },
  "x_manufacturer_offers": {
    "count": 3,
    "external": [
      {
        "joinId": "apple software",
        "doc": {
          "discount": 0.3
        }
      }
    ]
  }
}

So you can see that there are two external results sections, one for product offers and one for manufacturer offers, and how the offers are matched to the products by the join ids (which is either the product id, or the manufacturer).

Next time…

In my next blog post, I’ll dive in to another demonstration of XJoin, in which I show how to use click-through data to influence the score of subsequent searches.

The post XJoin for Solr, part 1: filtering using price discount data appeared first on Flax.

How we built a search engine for UK MP tweets with Solr, Python & StanfordNLP

Charlie Hull — Thu, 06 Feb 2014 10:44:59 +0000

Matt Pearce writes:

We recently released UKMP, a search application built on work done on last year’s Enterprise Search hack day. This presents the tweets of UK Members of Parliament with search options including filtering by party, retweet and favourite count, and entities (people, locations and organisations) extracted from the tweet text. This is obviously its first incarnation, so there are still a number of features in development, but I thought I would comment on some of the decisions taken while developing the site.

I started off by deciding which bits of the hack day code would be most useful, from both the Solr set-up side and the web application we were hoping to build. During the hack day, the group had split into a number of smaller teams, with two of them working on a set of data downloaded from Twitter, containing the original set of UK MP tweets. I took the basic Solr setup and indexing code from one group, and the initial web application from the other.

Obviously we couldn’t work with a completely static data set, so I set about putting together a Python script to grab the tweets. This was where I met the first hurdle: I was trying to grab tweets from individual MPs’ feeds, but kept getting blocked by the Twitter API, even though I didn’t think I was over-stepping the limits set on the calls. With 200-plus MPs to track, a different approach would be required to avoid being blocked. Eventually, I took a different approach, and started using the lists compiled by Tweetminster, who track politicians tweets themselves. This worked much better, and I could soon start building a useful data set.

I chose the second group’s web application because it already used the Stanford NLP software to extract entities from the tweet text. The indexer script, also written in Python, calls the web app to extract the entities before indexing the tweets. We spent some time trying to incorporate the Stanford sentiment analysis as well, but found it wasn’t practical – the response time was too slow, and we didn’t have time to train the dataset to provide a more useful analysis of the content (almost all tweets were rated as either “negative” or “neutral”, which didn’t accurately reflect the sentiments in the data).

Since this was an entirely new project, and because it was being done outside the main client workflow, I took the opportunity to try out AngularJS, an MVC-oriented JavaScript front-end framework. This runs on top of, and calls back to, the DropWizard web application, which provides the Model part of the Model-View-Controller system. AngularJS itself provides the Controller, while the Views are all written in fairly standard HTML, with some AngularJS frosting to fill in the content.

AngularJS itself generally made development very easy and fast, and I was pleased by how little JavaScript I had to write to build a working application (there is also a Bootstrap crossover module, providing AngularJS directives to work with the UI layout tools Bootstrap provides). As a small site, there are only two controllers in play: one for each page. AngularJS also makes it very easy to plug in other script modules, such as that used to generate the word cloud on the About page. However, I did come across a few sticking points as I built the app, as one might expect from a first-time user. The principle one was handling the search box at the top of the page, which had to be independent of the view while needing to modify it to display the search results. I am still not sure that I ended up with the best approach – the search form fires an event when submitted, which then percolates up the AngularJS control hierarchy until caught and dealt with: within the search page, the search is handled normally; from other pages, we redirect to the search page and pass in the term. It doesn’t feel as smooth as it should do, which is why I remain unconvinced this is the best solution.

All in all, this was an interesting sideline project, and provided a good excuse to try out some new technology. The code itself, along with some notes on how to get the system up and running, is in our github repository – feel free to try it out, and make suggestions for improvements or better ways to use the code.

The post How we built a search engine for UK MP tweets with Solr, Python & StanfordNLP appeared first on Flax.

Cambridge Search Meetup – a night of crawling and scraping

Charlie Hull — Fri, 22 Feb 2013 09:55:50 +0000

Last night was the busiest ever Cambridge Search Meetup, with two excellent talks and a lot of discussion and networking. First was Harry Waye of Arachnys, who provide access to data on emerging markets that no-one else has using a variety of custom crawling technology and heavy use of tools such Google Translate. If you want to trawl the Greek corporate registry or find out financial news from Kazakhstan a standard Google search is little help: Harry talked about how Arachnys have experimented with Google Custom Search Engine and the ‘headless browser’ PhantomJS to crawl sites.

Our second talk was from Shane Evans, who I first met when he led software development for our client Mydeco. While there he first worked on the development of an open source Python crawling framework, Scrapy: Shane showed how easy it is to get a Scrapy web spider running in a few lines of code, and how extensible and customisable Scrapy is for a huge variety of crawling and scraping situations. There’s even a fully hosted version at Scrapinghub with graphical tools for setting up web crawling and page scraping. We’re big fans of Scrapy at Flax and we’ve used it in a number of projects, so it was good to see an overview of why Scrapy exists and how it can be used.

Thanks to both our speakers who both travelled from out of town as did several other attendees: we’re pleased to say this was our 15th Meetup and we now have 100 members – we’re already planning further events, one will be on the evening of the first day of the Enterprise Search Europe conference.

The post Cambridge Search Meetup – a night of crawling and scraping appeared first on Flax.

Open source search engines and programming languages

Charlie Hull — Fri, 03 Sep 2010 10:40:16 +0000

So you’re writing a search-related application in your favourite language, and you’ve decided to choose an open source search engine to power it. So far, so good – but how are the two going to communicate?

Let’s look at two engines, Xapian and Lucene, and compare how this might be done. Lucene is written in Java, Xapian in C/C++ – so if you’re using those languages respectively, everything should be relatively simple – just download the source code and get on with it. However if this isn’t the case, you’re going to have to work out how to interface to the engine.

The Lucene project has been rewritten in several other languages: for C/C++ there’s Lucy (which includes Perl and Ruby bindings), for Python there’s PyLucene, and there’s even a .Net version called, not surprisingly, Lucene.NET. Some of these ‘ports’ of Lucene are ‘looser’ than others (i.e. they may not share the same API or feature set), and they may not be updated as often as Lucene itself. There are also versions in Perl, Ruby, Delphi or even Lisp (scary!) – there’s a full list available. Not all are currently active projects.

Xapian takes a different approach, with only one core project, but a sheaf of bindings to other languages. Currently these bindings cover C#, Java, Perl, PHP, Python, Ruby and Tcl – but interestingly these are auto-generated using the Simplified Wrapper and Interface Generator or SWIG. This means that every time Xapian’s API changes, the bindings can easily be updated to reflect this (it’s actually not quite that simple, but SWIG copes with the vast majority of code that would otherwise have to be manually edited). SWIG actually supports other languages as well (according to the SWIG website, “Common Lisp (CLISP, Allegro CL, CFFI, UFFI), Lua, Modula-3, OCAML, Octave and R. Also several interpreted and compiled Scheme implementations (Guile, MzScheme, Chicken)”) so in theory bindings to these could also be built relatively easily.

There’s also another way to communicate with both engines, using a search server. SOLR is the search server for Lucene, whereas for Xapian there is Flax Search Service. In this case, any language that supports Web Services (you’d be hard pressed to find a modern language that doesn’t) can communicate with the engine, simply passing data over the HTTP protocol.

The post Open source search engines and programming languages appeared first on Flax.

flax.crawler arrives

Charlie Hull — Mon, 02 Aug 2010 15:22:24 +0000

We’ve recently uploaded a new crawler framework to the Flax code repository. This is designed for use from Python to build a web crawler for your project. It’s multithreaded and simple to use, here’s a minimal example:

import crawler

crawler.dump = MyContentDumperImplementation()
crawler.pool.add_url(StdURL("http://test/"))
crawler.pool.add_url(StdURL("http://anothertest/"))
crawler.start()

Note that you can provide your own implementation of various parts of the crawler – and you must at least provide a ‘content dumper’ to store whatever the crawler finds and downloads.

We’ve also included a reference implementation, a working crawler that stores URLs and downloaded content in a SQLite3 database.

The post flax.crawler arrives appeared first on Flax.

flax.core 0.1 available

Tom — Thu, 24 Jun 2010 15:50:53 +0000

Charlie wrote previously that we try and work with flexible, lightweight frameworks: flax.core is a Python library for conveniently adding functionality to Xapian projects. The current (and first!) version is 0.1, which can be checked out from the flaxcode repository. This version supports named fields for indexing and search (no need to deal with prefixes or value numbers), facets, simplified query construction, and an optional action-oriented indexing framework.

Unlike Xappy, flax.core makes no attempt to abstract or hide the Xapian API, and is therefore aimed at a rather different audience. The reason is our observation that “interesting” search applications often require customisation at the Xapian API level, for example bespoke MatchDeciders, PostingSources or Sorters. Rather than having to dive in and modify the flax.core code, these application-specific modifications can happily co-exist with the unmodified flax.core (at least, this is the intention). It is also intended that flax.core remains minimal enough to easily port to other languages such as PHP or Java.

The primary flax.core class is Fieldmap, which associates a set of named fields with a Xapian database. As an example, the following code sets up a simple structure of one ‘freetext’ and one ‘filter’ field:

    import xapian
    import flax.core

    db = xapian.WritableDatabase('db', xapian.DB_CREATE)
    fm = flax.core.Fieldmap()
    fm.language = 'en'              # stem for English
    fm.setfield('mytext', False)      # freetext field
    fm.setfield('mydate', True)       # filter field

    fm.save(db)

and this code indexes some text and a datetime:

    doc = fm.document()
    doc.index('mytext', "I don't like spam.")
    doc.index('mydate', datetime(2010, 2, 3, 12, 0))
    fm.add_document(db, doc)
    db.flush()

Fields can be of type string, int, float or datetime. These are handled automatically, and are not tied to fieldnames (so it would be possible to have field instances of different types, not that this is a good idea).

Indexing can also be performed by the Action framework. In this case, a text file contains a list of:

external identifiers (such as XPaths, SQL column name etc)
flax fieldname
indexing actions

For example, an actions file for XML might look like this:

    .//metadata[@name='Author']/@value
        author: filter(facet)
        author2: index(default)

    .//metadata[@name='Year']/@value
        published: numeric

This means that ‘Author’ metadata elements are indexed as two flax fields: ‘author’ is a filter field which stores facet values, while ‘author2’ is a freetext field which is searchable by default. ‘Year’ metadata elements are indexed as the flax field ‘published’, which is numeric.

The flaxcode repository contains two example flax.core applications here:

    applications/flax_core_examples

One is an XML indexer implemented in less than 100 lines, the other is a minimal web search application in a similar number of lines. Currently there is no documentation other than these examples and the docstrings in flax.core. If anyone needs some, I’ll put some together.

The post flax.core 0.1 available appeared first on Flax.

Packaged solutions and customisability, the Python way

Charlie Hull — Mon, 14 Jun 2010 11:14:43 +0000

With any large scale software installation, there is going to be some customisation and tweaking necessary, and enterprise search systems are no exception. Whatever features are packaged with a system, some of those you need will be missing and some won’t be used at all. It’s rare to see a situation where the search engine can just be installed straight out of the box.

Our Flax system is based on the Xapian core, which has a set of bindings to various different languages including Perl, Python, PHP, Java, Ruby, C# and even TCL, which makes integration with systems where a particular language is preferred relatively easy. However for the Flax layer itself (comprising file filters, indexers, crawlers, front ends, administration tools etc. – the ‘toolkit’ for building a complete search system) we chose Python, for much the same reasons as the Ultraseek developers did back in 2003.

The flexibility of Python means we can add any missing features very fast, and create complete new systems in a matter of days – for example, often a complete indexer can be created in less than 50 lines of code, by re-using existing components and taking advantage of the many Python modules available (such as XML parsers). Our open source approach also means that solutions we create for one customer can often be repurposed and adapted for another – which again makes for very short development cycles. Python is also available on a wide variety of platforms.

We’re not alone in our preference for Python of course!

The post Packaged solutions and customisability, the Python way appeared first on Flax.

Python and Flax presentation

Charlie Hull — Thu, 25 Jun 2009 08:49:25 +0000

My colleague Richard Boulton will be presenting at Europython in Birmingham, U.K. next week, specifically at 15.30 on Tuesday 30th June – an abstract is available. He’ll be talking about Xapian, Xappy and Flax, and showing examples of these in action including one using a Django integration layer.

The post Python and Flax presentation appeared first on Flax.