Searching for IP addresses in text with Elasticsearch -

We recently implemented a search solution for a customer using Elasticsearch. Most of their requirements were fairly standard, however they also wanted to be able to search for IP addresses embedded in the document text, using a flexible and precise search syntax, e.g. given the following document fragment:

    ... the API can be accessed at 167.87.3.201 on port 8700 ...

the following searches should all find the document:

  167.87.3.201
  *.87.3.201
  *.87.*.201
  167.[80-100].3.*
  etc.

While it would have been possible to implement the multiple wildcard requirement with Elasticsearch/Lucene regular expression queries, there is no simple way to handle the numeric range requirement without constructing some fairly complex regexps. Furthermore, regular expression queries can be slow to run (depending on the complexity of the expression and the size of the index), and this application had a large index.

The obvious thing to do here is to parse the IP address into separate numbers and index it into numeric fields. e.g.:

  {
    "ip1": 167,
    "ip2": 87,
    "ip3": 3,
    "ip4": 201,
    "text": "the API can be ..."
  }

Then, user queries such as “167.[80-100].3.*” can be parsed into an Elasticsearch query:

  {
    "query": {
      "bool": {
        "must": [
          { "term": { "ip1": 167 }},
          { "range": { "ip2": { "from": 80, "to": 100 }}},
          { "term": { "ip3": 3 }}
        ]
      }}}

(please note that these queries are for illustrative purposes only, and are untested).

Unfortunately, this approach fails when there is more than one IP address per document (as there generally was in this case), since if multiple values exist for the ipN fields the relationship between each component is lost. For example, a document containing:

    ... servers at 167.133.88.1 and 176.90.3.10 are load balanced ...

would spuriously match the user query above, despite the fact that neither IP address matches the query exactly. One possibility would be to use dynamic fields to index each address to a different set of fields:

  {
    "ip1_1": 167,
    "ip2_1": 133,
    "ip3_1": 88,
    "ip4_1": 1,
    "ip1_2": 176,
    "ip2_2": 90,
    "ip3_2": 3,
    "ip4_2": 10,
  }

However, queries would have to cover all possible IP fields with repeated OR subqueries, which would quickly become ugly and unmanagable.

Luckily, Elasticsearch nested documents provide exactly the mechanism we need to preserve the IP address structure within the main document (Solr does too, though this post does not go into the details). This is most easily explained with a JSON example with two IP addresses:

  {
    "text": "Lorem ipsum dolor sit amet, ei impetus persecuti eam...",
    "ipaddr" : [
      {
        "ip1": 167,
        "ip2": 133,
        "ip3": 88,
        "ip4": 1
      },
      {
        "ip1": 176,
        "ip2": 90,
        "ip3": 3,
        "ip4": 10
      }
    ]
  }

This requires a declaration of the ipaddr type as “nested” in the index mapping:

  ...
  "mappings": {
    "document": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "standard"
        },
        "ipaddr" : {
          "type" : "nested"
        },
        ...
      }}}

The child documents are created by the indexer script, which uses a regular expression to find all IP addresses in the document content and parses them into separate numbers. IP addresses can then be searched for using the nested query type, e.g:

  {
    "nested" : {
      "path" : "ipaddr",
      "query" : {
        "bool": {
            "must": [
              { "term": { "ip1": 167 }},
              { "range": { "ip2": { "from": 80, "to": 100 }}},
              { "term": { "ip3": 3 }}
            ]}}}}

This query selects parent documents containing at least one ipaddr child document which matches the query. Internally, children are stored as separate documents from parents, but the join is done transparently and extremely fast.

Nested queries can, of course, be combined with text queries etc. The application we built for the client (in AngularJS and Python/Flask) parses user queries to extract IP query expressions and builds combined text, boolean and nested queries to implement the required search logic.

One slight problem with this approach is that IP addresses are not included in any highlighted summaries generated by Elasticsearch as part of search results. This is because the highlighter does not know where in the text the matching IP address is. There is no simple way around this, so to generate highlighted search summaries we used our own standalone highlighter component, extending it to ‘understand’ the IP query syntax. This code is Apache 2 licensed and is free to download and use.

To sum up, this post outlines how we used Elasticsearch’s nested document type to implement a flexible and fast IP address search syntax. Of course, the same approach could be used to search any other type of structured entity in document text, such as social security numbers, ISBNs etc.

Facebook

Google+

Twitter

Leave a Reply Cancel reply