Elasticsearch and Logstash notes

1. Introduction
2. Design
3. Presentations
4. Elasticsearch Examples
5. Logstash Examples
- 5.1. Split log files into separate files by time or node
- 5.2. Capture edits of Wikipedia pages from IRC

1 Introduction

This is a list of tests, examples, and scripts that I have created in order to either reproduce an issue, test a bugfix, or validate a behavior.

Most of these examples will either be in a shell format, relying on the use of curl, or they will be in es-mode format, which will also work in Sense. If you are reading this as an org-mode file, you can tangle blocks to generate scripts if so desired.

If you are an Emacs user and want the original, plain-text .org file, replace the .html for any page with .org to download the file.

This file was last exported: 2016-08-04 Thu 09:37

2 Design

I do a lot of design in org-mode also. My definition of "design" is really more of a note-taking or measurement-gathering example, so some of these may be more like scratch pads and some will be more like concrete design docs.

As with any of this information, it could be out of date, or it could be entirely wrong as I test against an older version of Elasticsearch.

3 Presentations

I've given a few Elasticsearch presentations, the ones that are publicly available are listed here:

shadow replica demo [.org] [.es]

4 Elasticsearch Examples

4.1 Target where field values should be retrieved from

Sometimes, it can be useful to tell Elasticsearch where to retrieve data from, because it can be returned in different formats depending on where you get it.

4.1.1 Create the index

Create 3 different string fields, where:

_source is stored (body1)
the field is stored by Lucene (body2)
the field is not stored at all (body3)

DELETE /4492
{}

POST /4492
{
  "mappings": {
    "doc": {
      "_source": {
        "enabled": true,
        "includes": ["body1"],
        "excludes": ["body2", "body3"]
      },
      "properties": {
        "body1": {"type": "string"},
        "body2": {"type": "string", "store": true},
        "body3": {"type": "string", "store": false},
        "when": {"type": "date", "store": false, "format": "basic_date_time"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.1.2 Index docs

Index two documents with the different type storage options. The first document has a date indexed as a string, the other as an integer.

POST /4492/doc/1?refresh
{
  "body1": "foo",
  "body2": "foo",
  "body3": "foo",
  "when": "20140113T121628.345-0700"
}

POST /4492/doc/2?refresh
{
  "body1": "bar",
  "body2": "bar",
  "body3": "bar",
  "when": 1389636769
}

{"_index":"4492","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"4492","_type":"doc","_id":"2","_version":1,"created":true}

4.1.3 Old-style Query (no sources specified)

Can't retrieve the body3 and when fields here, because they aren't stored in either _source or Lucene.

POST /4492/_search?pretty
{
  "query": {
    "match_all": {}
  },
  "fields": ["body1", "body2", "body3", "when"]
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "fields" : {
        "body1" : [ "foo" ],
        "body2" : [ "foo" ]
      }
    }, {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "body1" : [ "bar" ],
        "body2" : [ "bar" ]
      }
    } ]
  }
}

4.1.4 New-style Query (fielddata_fields)

Retrieving the body3 and when fields from the field data cache, notice that the when field is always returned as a number, even if it was sent to Elasticsearch as a string.

POST /4492/_search?pretty
{
  "query": {
    "match_all": {}
  },
  "fields": ["body1", "body2"],
  "fielddata_fields": ["body3", "when"]
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "fields" : {
        "body2" : [ "foo" ],
        "body3" : [ "foo" ],
        "body1" : [ "foo" ],
        "when" : [ 1389640588345 ]
      }
    }, {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "body2" : [ "bar" ],
        "body3" : [ "bar" ],
        "body1" : [ "bar" ],
        "when" : [ 1389636769 ]
      }
    } ]
  }
}

4.1.5 Script fields query

Retrieving body3 and when as script fields, this also uses fielddata, but will be a bit slower because it goes through script execution, using fielddata_fields is a better way to do this.

POST /4492/_search?pretty
{
  "query": {
    "match_all": {}
  },
  "fields": ["body1", "body2"],
  "script_fields": {
    "body3": {
      "script": "doc[\"body3\"].value"
    },
    "when": {
      "script": "doc[\"when\"].value"
    }
  }
}

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "fields" : {
        "body2" : [ "foo" ],
        "body3" : [ "foo" ],
        "body1" : [ "foo" ],
        "when" : [ 1389640588345 ]
      }
    }, {
      "_index" : "4492",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "fields" : {
        "body2" : [ "bar" ],
        "body3" : [ "bar" ],
        "body1" : [ "bar" ],
        "when" : [ 1389636769 ]
      }
    } ]
  }
}

4.2 Create a new array field or append to it with a script

If the array doesn't already exist, it needs to be created, this is an example script to show doing just that.

4.2.1 Create the index

DELETE /test
{}

POST /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "string"
        },
        "tags": {
          "type": "nested",
          "properties": {
            "innerName": {
              "type": "string"
            },
            "value": {
              "type": "long"
            }
          }
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.2.2 Index doc

POST /test/doc/1?refresh
{"name": "Mike"}

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}

4.2.3 Update the document

POST /test/doc/1/_update?lang=groovy
{
  "script": "if (ctx._source.tags == null) { ctx._source.tags = [newtag] } else { ctx._source.tags.add(newtag) }",
  "params": {
    "newtag": {
      "value": 7,
      "innerName": "John"
    }
  }
}

{"_index":"test","_type":"doc","_id":"1","_version":2}

4.2.4 Retrieve

GET /test/doc/1?pretty
{}

{
  "_index" : "test",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source":{"name":"Mike","tags":[{"innerName":"John","value":7}]}
}

4.3 De-compound words to transform large conjunctions to multiple tokens

In this example, the text "catdogmouse" can be transformed into the different tokens "cat", "doc", and "mouse" using a decompounding token filter.

4.3.1 Create the index

DELETE /decom
{}

POST /decom
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index": {
      "analysis": {
        "analyzer": {
          "decom_analyzer": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["decom_filter"]
          }
        },
        "filter": {
          "decom_filter": {
            "type": "dictionary_decompounder",
            "word_list": ["cat", "dog", "mouse"]
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "decom_analyzer"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.3.2 Analyze some text

es-mode requires the body of the request be inside a "{}", which is a bug I need to fix…

POST /decom/_analyze?field=body&pretty
{racecatthings}

POST /decom/_analyze?field=body&pretty
{catdogmouse}

{
  "tokens" : [ {
    "token" : "racecatthings",
    "start_offset" : 1,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "cat",
    "start_offset" : 1,
    "end_offset" : 14,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

{
  "tokens" : [ {
    "token" : "catdogmouse",
    "start_offset" : 1,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "cat",
    "start_offset" : 1,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "dog",
    "start_offset" : 1,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "mouse",
    "start_offset" : 1,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

4.4 Excluding a field from _source is still searchable

Demonstrating that a field not contained in the _source is still searchable.

4.4.1 Create the index

DELETE /exs-filter
{}

POST /exs-filter
  "mappings": {
    "doc": {
      "_source": {
        "excludes": ["ratings"]
      },
      "properties": {
        "body": {"type": "string"},
        "ratings": {"type": "string"}
      }
    }
  }
}

{"error":"IndexMissingException[[exs-filter] missing]","status":404}
{"ok":true,"acknowledged":true}

4.4.2 Index some docs

POST /exs-filter/doc/1
{"body": "foo", "ratings": "bar"}

POST /exs-filter/_refresh
{}

{"ok":true,"_index":"exs-filter","_type":"doc","_id":"1","_version":1}
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

4.4.3 Perform the query

curl -XPOST 'localhost:9200/exs-filter/_search?pretty' -d'
{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "ratings": "bar"
        }
      }
    }
  }
}'

{
  "took" : 24,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "exs-filter",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0, "_source" : {"body":"foo"}
    } ]
  }
}

4.5 Output format for Geo-point data types

Someone recently asked which format geo data was returned in, it is returned in the format it was indexed in, this example demonstrates the different formats a geo-point can be indexed in.

4.5.1 Create the index

DELETE /test
{}

POST /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "geo_point"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.5.2 Index docs

POST /test/doc/1
{"body":
 {
   "lat": 41.12,
   "lon": -71.34
 }
}

POST /test/doc/2
{"body": "41.12,-71.34"}

POST /test/doc/3
{"body": "drm3btev3e86"}

POST /test/doc/4
{"body": [-71.34, 41.12]}

POST /test/_refresh
{}

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"test","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"test","_type":"doc","_id":"4","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.5.3 Query

POST /test/_search?pretty&fields=_source,body
{
  "query": {
    "match_all": {}
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"body":
 {
   "lat": 41.12,
   "lon": -71.34
 }
}
    }, {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"body": "41.12,-71.34"},
      "fields" : {
        "body" : [ "41.12,-71.34" ]
      }
    }, {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 1.0,
      "_source":{"body": "drm3btev3e86"},
      "fields" : {
        "body" : [ "drm3btev3e86" ]
      }
    }, {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 1.0,
      "_source":{"body": [-71.34, 41.12]},
      "fields" : {
        "body" : [ -71.34, 41.12 ]
      }
    } ]
  }
}

4.6 Using an array of types in an `ids` query

Even though it's "type" and not "types", multiple types can be specified as an array.

4.6.1 Create the index

DELETE /test
{}

POST /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc1": {
      "properties": {
        "body": {"type": "string"}
      }
    },
    "doc2": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.6.2 Index docs

POST /test/doc1/1
{"body": "foo"}

POST /test/doc2/2
{"body": "foo"}

POST /test/_refresh
{}

{"_index":"test","_type":"doc1","_id":"1","_version":1,"created":true}
{"_index":"test","_type":"doc2","_id":"2","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.6.3 Query

POST /test/_search?pretty
{
  "query": {
    "ids": {
      "type": ["doc1", "doc2"],
      "values": ["1", "2"]
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "doc1",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"body": "foo"}
    }, {
      "_index" : "test",
      "_type" : "doc2",
      "_id" : "2",
      "_score" : 1.0,
      "_source":{"body": "foo"}
    } ]
  }
}

4.7 Changing the default postings format for a field

For instance, working around bloom filter generation, or if you want to live on the edge and use a non-supported format (don't do this).

4.7.1 Mapping settings

DELETE /post
{}

POST /post
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "postings_format": "Lucene41"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.8 `top_hits` aggregation with a Groovy script (`_score`)

Scripting can be combined with the top_hits aggregation for custom scoring of the joined hits. Not saying you should do this, but you can if you need to…

4.8.1 Create the index

DELETE /test
{}

POST /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"},
        "domain": {"type": "integer"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.8.2 Index docs

POST /test/doc/1
{"body": "elections", "domain": 1}

POST /test/doc/2
{"body": "nope elections", "domain": 2}

POST /test/doc/3
{"body": "nope", "domain": 2}

POST /test/_refresh
{}

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"test","_type":"doc","_id":"3","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.8.3 Query

POST /test/_search?pretty
{
  "query": {
    "match": {
      "body": "elections"
    }
  },
  "aggs": {
    "top-sites": {
      "terms": {
        "field": "domain",
        "order": {
          "top_hit": "desc"
        }
      },
      "aggs": {
        "top_tags_hits": {
          "top_hits": {}
        },
        "top_hit" : {
          "max": {
            "script": "_score",
            "lang": "groovy"
          }
        }
      }
    }
  }
}

{
  "took" : 669,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"body": "elections", "domain": 1}
    }, {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.625,
      "_source":{"body": "nope elections", "domain": 2}
    } ]
  },
  "aggregations" : {
    "top-sites" : {
      "buckets" : [ {
        "key" : 2,
        "doc_count" : 1,
        "top_hit" : {
          "value" : 0.0
        },
        "top_tags_hits" : {
          "hits" : {
            "total" : 1,
            "max_score" : 0.625,
            "hits" : [ {
              "_index" : "test",
              "_type" : "doc",
              "_id" : "2",
              "_score" : 0.625,
              "_source":{"body": "nope elections", "domain": 2}
            } ]
          }
        }
      }, {
        "key" : 1,
        "doc_count" : 1,
        "top_hit" : {
          "value" : 0.0
        },
        "top_tags_hits" : {
          "hits" : {
            "total" : 1,
            "max_score" : 1.0,
            "hits" : [ {
              "_index" : "test",
              "_type" : "doc",
              "_id" : "1",
              "_score" : 1.0,
              "_source":{"body": "elections", "domain": 1}
            } ]
          }
        }
      } ]
    }
  }
}

4.9 Using the `field_value_factor` function in a function score query

By far the most common use case I see for function_score is multiplying the score of a document by some field inside the document, whether it be star rating for hotels, or popularity for foods. So instead of requiring the user to write an Groovy script, it would be nice if we could provide an easy way to do this.

~~Source here: https://github.com/dakrone/…~~ Defunct, this has been merged to Elasticsearch.

4.9.1 Create the index

DELETE /fvfs
{}

POST /fvfs
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"},
        "popularity": {"type": "integer"}
      }
    }
  }
}

HTTP/1.1 404 Not Found
Content-Type: application/json; charset=UTF-8
Content-Length: 62

{"error":"IndexMissingException[[fvfs] missing]","status":404}
{"acknowledged":true}

4.9.2 Index docs

POST /fvfs/doc/1
{"body": "foo foo", "popularity": 7}

POST /fvfs/doc/2
{"body": "foo", "popularity": 5}

POST /fvfs/doc/3
{"body": "foo", "popularity": [2, 99]}

POST /fvfs/doc/4
{"body": "foo eggplant", "popularity": 0}

POST /fvfs/doc/5
{"body": "foo bar"}

{"_index":"fvfs","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"fvfs","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"fvfs","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"fvfs","_type":"doc","_id":"4","_version":1,"created":true}
{"_index":"fvfs","_type":"doc","_id":"5","_version":1,"created":true}

4.9.3 Query

POST /fvfs/_search?pretty
{
  "query": {
    "function_score": {
      "query": {
        "simple_query_string": {
          "query": "foo",
          "fields": ["body"]
        }
      },
      "functions": [
        {
          "filter": {
            "range": {
              "popularity": {
                "lte": 100
              }
            }
          },
          "field_value_factor": {
            "field": "popularity",
            "factor": 3.5,
            "modifier": "log2p"
          }
        }
      ],
      "score_mode": "max",
      "boost_mode": "sum"
    }
  },
  "explain": false
}

{
  "took" : 90,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : 2.1459785,
    "hits" : [ {
      "_index" : "fvfs",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 2.1459785,
      "_source":{"body": "foo foo", "popularity": 7}
    }, {
      "_index" : "fvfs",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 2.107713,
      "_source":{"body": "foo", "popularity": 5}
    }, {
      "_index" : "fvfs",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 1.7719209,
      "_source":{"body": "foo", "popularity": [2, 99]}
    }, {
      "_index" : "fvfs",
      "_type" : "doc",
      "_id" : "5",
      "_score" : 1.511049,
      "_source":{"body": "foo bar"}
    }, {
      "_index" : "fvfs",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.812079,
      "_source":{"body": "foo eggplant", "popularity": 0}
    } ]
  }
}

4.10 Naming a query to return which part of the query matched

Sometimes people ask how they can tell which part of a query matched a particular document, ES queries all support the _name field, which is then returned in the hits to indicate which of the queries matched.

4.10.1 Create an index

DELETE /named
{}

POST /named
{
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"ok":true,"acknowledged":true}
{"ok":true,"acknowledged":true}

POST /named/doc/1
{"body": "foo"}

POST /named/_refresh
{}

{"ok":true,"_index":"named","_type":"doc","_id":"1","_version":1}
{"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}

POST /named/_search?pretty
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "body": {
              "_name": "blah",
              "boost": 1.1,
              "value": "foo"
            }
          }
        }
      ]
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "named",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.30685282, "_source" : {"body": "foo"},
      "matched_filters" : [ "blah" ]
    } ]
  }
}

4.11 Dynamically change logging level of Elasticsearch servers

I always forget this, so here is how to do it dynamically with the cluster update settings API:

PUT /_cluster/settings
{
  "transient": {
    // change the root logging level
    "logger._root": "DEBUG",
    // set it for a regular namespace, the "org.elasticsearch" is not required
    "logger.recovery": "TRACE"
  }
}

{"acknowledged":true,"persistent":{},"transient":{"logger":{"_root":"DEBUG","recovery":"TRACE"}}}

4.12 Blocking a cluster from reading or writing

Sometimes you don't want anyone reading or writing to your cluster, you can do this with a cluster block, which is not well documented:

4.12.1 Create an index and index a document

DELETE /test
{}

POST /test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}

{"acknowledged":true}
{"acknowledged":true}

POST /test/doc/1
{"body": "foo"}

{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}

4.12.2 Apply a cluster block

PUT /_cluster/settings
{
  "transient": {
    // the whole cluster is read-only now
    "cluster.blocks.read_only": true
  }
}

{"acknowledged":true,"persistent":{},"transient":{"cluster":{"blocks":{"read_only":"true"}}}}

4.12.3 Try some operations that are forbidden now

POST /test/doc/2
{"body": "foo"}

POST /newindex
{}

HTTP/1.1 403 Forbidden
Content-Type: application/json; charset=UTF-8
Content-Length: 98

{"error":"ClusterBlockException[blocked by: [FORBIDDEN/6/cluster read-only (api)];]","status":403}
HTTP/1.1 403 Forbidden
Content-Type: application/json; charset=UTF-8
Content-Length: 98

{"error":"ClusterBlockException[blocked by: [FORBIDDEN/6/cluster read-only (api)];]","status":403}

4.12.4 Query

Queries still work, because they're read-only

POST /test/_search?pretty
{
  "query": {
    "match_all": {}
  }
}

{
  "took" : 66,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "test",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"body": "foo"}
    } ]
  }
}

4.12.5 Undo the cluster block

And then put a new document, to prove it's undone

PUT /_cluster/settings
{
  "transient": {
    // the cluster read be written to now
    "cluster.blocks.read_only": false
  }
}

POST /test/doc/2
{"body": "foo"}

{"acknowledged":true,"persistent":{},"transient":{"cluster":{"blocks":{"read_only":"false"}}}}
{"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}

4.13 Sorting with a script

Sometimes you may want to transform a field for sorting. NOTE a better way to do this would be to use function_score to score based on the values of the strings, but this is to demonstrate doing it with sorting.

4.13.1 Create an index

DELETE /script-sort
{}

POST /script-sort
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.13.2 Index docs

POST /script-sort/doc/1
{"body": "foo"}

POST /script-sort/doc/2
{"body": "bar"}

POST /script-sort/doc/3
{"body": "baz"}

POST /script-sort/_refresh
{}

{"_index":"script-sort","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"script-sort","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"script-sort","_type":"doc","_id":"3","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.13.3 Query

POST /script-sort/_search?pretty
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_script" : {
        "script" : "meanings.get(doc['body'].value)",
        "type" : "number",
        "lang": "groovy",
        "params" : {
          "meanings" : {
            "foo": 2,
            "bar": 1,
            "baz": 3
          }
        },
        "order" : "asc"
      }
    }
  ]
}

{
  "took" : 667,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "script-sort",
      "_type" : "doc",
      "_id" : "2",
      "_score" : null,
      "_source":{"body": "bar"},
      "sort" : [ 1.0 ]
    }, {
      "_index" : "script-sort",
      "_type" : "doc",
      "_id" : "1",
      "_score" : null,
      "_source":{"body": "foo"},
      "sort" : [ 2.0 ]
    }, {
      "_index" : "script-sort",
      "_type" : "doc",
      "_id" : "3",
      "_score" : null,
      "_source":{"body": "baz"},
      "sort" : [ 3.0 ]
    } ]
  }
}

4.14 Doc-values with arrays of object fields

It should be possible to use doc_values for arrays of object fields according to Adrien

4.14.1 Create an index

Creating an index with doc_values used for each field of an object array

DELETE /dv
{}

POST /dv
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "person": {
          "type": "object",
          "properties": {
            "first": {
              "type": "string",
              "index": "not_analyzed",
              "doc_values": true
            },
            "last": {
              "type": "string",
              "index": "not_analyzed",
              "doc_values": true
            }
          }
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.14.2 Index docs

POST /dv/doc/1
{"person":
 [
   {
     "first": "John",
     "last": "Smith"
   },
   {
     "first": "Sally",
     "last": "Bones"
   },
   {
     "first": "John",
     "last": "Carter"
   }
 ]
}

POST /dv/_refresh
{}

{"_index":"dv","_type":"doc","_id":"1","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.14.3 Query

POST /dv/_search?search_type=count&pretty
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "myfirstnames": {
      "terms": {
        "field": "person.first"
      }
    },
    "mylastnames": {
      "terms": {
        "field": "person.last"
      }
    }
  }
}

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "mylastnames" : {
      "buckets" : [ {
        "key" : "Bones",
        "doc_count" : 1
      }, {
        "key" : "Carter",
        "doc_count" : 1
      }, {
        "key" : "Smith",
        "doc_count" : 1
      } ]
    },
    "myfirstnames" : {
      "buckets" : [ {
        "key" : "John",
        "doc_count" : 1
      }, {
        "key" : "Sally",
        "doc_count" : 1
      } ]
    }
  }
}

And to show there is no fielddata used:

GET /_nodes/stats/indices?fields=*&pretty
{}

{
  "memory_size_in_bytes": 0,
  "evictions": 0,
  "fields": {}
}

4.15 Wikimedia's source_regex query equivalent

I wanted to see if it was possible to create an equivalent to Wikimedia's source_regex query plugin, so this is me trying to do it.

It's basically trigrams with a regex query rescore.

4.15.1 Create an index

DELETE /wm
{}

POST /wm
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "index": {
      "analysis": {
        "analyzer": {
          "trigram": {
            "type": "custom",
            "tokenizer": "trigram_t",
            "filter": ["lowercase"]
          }
        },
        "tokenizer": {
          "trigram_t": {
            "type": "ngram",
            "min_gram": 3,
            "max_gram": 3
          }
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "analyzer": "trigram"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.15.2 Index docs

POST /wm/doc/1
{"body": "I can has test"}

POST /wm/doc/2
{"body": "I can't has cheezburger"}

POST /wm/doc/3
{"body": "can I have some things?"}

POST /wm/doc/4
{"body": "who art thou, to has such things?"}

POST /wm/doc/5
{"body": "Can I have that?"}

POST /wm/_refresh
{}

{"_index":"wm","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"wm","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"wm","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"wm","_type":"doc","_id":"4","_version":1,"created":true}
{"_index":"wm","_type":"doc","_id":"5","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.15.3 Query

I reuse the "i ca..has" because I am simulating a client querying using the same text in both places, but it could just as easily do it separately.

POST /wm/_search?pretty
{
  "query": {
    "match": {
      "body": "i ca..has"
    }
  },
  "rescore": {
    "window_size": 10,
    "query": {
      "rescore_query": {
        "regexp": {
          "body": "i ca..has"
        }
      }
    }
  }
}

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.112542905,
    "hits" : [ {
      "_index" : "wm",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.112542905,
      "_source":{"body": "I can has test"}
    }, {
      "_index" : "wm",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.08440718,
      "_source":{"body": "I can't has cheezburger"}
    }, {
      "_index" : "wm",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.0057871966,
      "_source":{"body": "who art thou, to has such things?"}
    } ]
  }
}

Drew asked how "i ca..has" would be analyzed, so:

curl -XPOST 'localhost:9200/wm/_analyze?analyzer=trigram&pretty' -d'i ca..has'

{
  "tokens" : [ {
    "token" : "i c",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : " ca",
    "start_offset" : 1,
    "end_offset" : 4,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "ca.",
    "start_offset" : 2,
    "end_offset" : 5,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "a..",
    "start_offset" : 3,
    "end_offset" : 6,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "..h",
    "start_offset" : 4,
    "end_offset" : 7,
    "type" : "word",
    "position" : 5
  }, {
    "token" : ".ha",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "has",
    "start_offset" : 6,
    "end_offset" : 9,
    "type" : "word",
    "position" : 7
  } ]
}

4.16 String interpretation in Groovy Scripts

Since String.format() is not in the whitelist, sometimes it's nice to be able to use string interpretation in scripts. Groovy allows doing this with GString interpretation.

4.16.1 Create an index

DELETE /script
{}

POST /script
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.16.2 Index docs

POST /script/doc/1?refresh
{"body": "foo"}

POST /script/doc/1/_update
{
  "script": "ctx._source.body = ctx._source.body + \"${bar}\"",
  "params": {
    "bar": " hi"
  }
}

GET /script/doc/1?_source=body
{}

{"_index":"script","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"script","_type":"doc","_id":"1","_version":2}
{"_index":"script","_type":"doc","_id":"1","_version":2,"found":true,"_source":{"body":"foo hi"}}

4.17 Using BM25 or DFR instead of TF-IDF

While TF-IDF does a great job, sometimes people may want to use BM25, which is another nice similarity algorithm. This is an example of setting it up per-field so you can compare the two algorithms.

I did this with a multi-field that indexed the body field with all the different similarities, just so I could compare all at once. The interesting thing about this is that when I spoke to Robert about this, there's nothing that's actually being changed during indexing, it's just safety in case that ever needs to be the case.

I'd like to make it configurable at query time, I think I have a branch for it somewhere…

4.17.1 Create an index

DELETE /sim
{}

POST /sim
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "similarity": {
      "custom_bm25": {
        "type": "BM25",
        // These are the default values
        "k1": 1.2, // how important term frequency is
        "b": 0.75 // how normalized field length should be
      }
    },
    "analysis": {
      "analyzer": {
        "sim_analyzer": {
          "tokenizer": "standard",
          "filters": ["lowercase", "kstem", "stop"]
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "fields": {
            "tfidf": {
              "type": "string",
              "similarity": "tfidf"
            },
            "dfr": {
              "type": "string",
              "similarity": "dfr"
            },
            "bm25": {
              "type": "string",
              // "BM25" could be used here to use the default values
              "similarity": "custom_bm25"
            }
          }
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.17.2 Index docs

POST /sim/doc/1
{"body": "A quick brown fox jumped over the lazy brown dog"}

POST /sim/doc/2
{"body": "Fast jumping brown spiders"}

POST /sim/doc/3
{"body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders"}

POST /sim/_refresh
{}

{"_index":"sim","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"sim","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"sim","_type":"doc","_id":"3","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.17.3 Query

Here I funnel all of the output through

jq ".hits.hits[]"

to output only the documents that match with their scores.

First with the traditional TF-IDF:

POST /sim/_search
{
  "query": {
    "multi_match": {
      "query": "jumping brown dogs",
      "minimum_should_match": "30%",
      "fields": ["body.tfidf"]
    }
  }
}

{
  "_index": "sim",
  "_type": "doc",
  "_id": "2",
  "_score": 0.391954,
  "_source": {
    "body": "Fast jumping brown spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "3",
  "_score": 0.26056325,
  "_source": {
    "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "1",
  "_score": 0.03540124,
  "_source": {
    "body": "A quick brown fox jumped over the lazy brown dog"
  }
}

Then with BM25:

POST /sim/_search
{
  "query": {
    "multi_match": {
      "query": "jumping brown dogs",
      "minimum_should_match": "30%",
      "fields": ["body.bm25"]
    }
  }
}

{
  "_index": "sim",
  "_type": "doc",
  "_id": "2",
  "_score": 0.9845756,
  "_source": {
    "body": "Fast jumping brown spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "3",
  "_score": 0.8407544,
  "_source": {
    "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "1",
  "_score": 0.060791545,
  "_source": {
    "body": "A quick brown fox jumped over the lazy brown dog"
  }
}

Finally with DFR (Divergence From Randomness):

POST /sim/_search
{
  "query": {
    "multi_match": {
      "query": "jumping brown dogs",
      "minimum_should_match": "30%",
      "fields": ["body.dfr"]
    }
  }
}

{
  "_index": "sim",
  "_type": "doc",
  "_id": "2",
  "_score": 0.391954,
  "_source": {
    "body": "Fast jumping brown spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "3",
  "_score": 0.26056325,
  "_source": {
    "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders"
  }
}
{
  "_index": "sim",
  "_type": "doc",
  "_id": "1",
  "_score": 0.03540124,
  "_source": {
    "body": "A quick brown fox jumped over the lazy brown dog"
  }
}

4.18 Get the current time in a Groovy script

In MVEL, it used to be easy to use time() to get the current time in a script, however the Groovy script engine removed this. There is another way to get the current time however.

4.18.1 Create an index and index a test document

DELETE /groovy
{}

POST /groovy
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "num": {"type": "long"}
      }
    }
  }
}

POST /groovy/doc/1?refresh
{"num": 5}

{"acknowledged":true}
{"acknowledged":true}
{"_index":"groovy","_type":"doc","_id":"1","_version":1,"created":true}

4.18.2 Update using the DateTime object

POST /groovy/doc/1/_update
{
  "script": {
    "script": "ctx._source.num = DateTime.now().getMillis()"
  }
}

GET /groovy/doc/1?pretty
{}

{"_index":"groovy","_type":"doc","_id":"1","_version":2}
{
  "_index" : "groovy",
  "_type" : "doc",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source":{"num":1420625502499}
}

4.19 Formatting strings in Groovy scripts

Sometimes it can be helpful to format strings using String.format inside of a Groovy script. In addition, Groovy has string interpolation of its own that you can use.

4.19.1 Create an index and index a test document

DELETE /groovy
{}

POST /groovy
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

POST /groovy/doc/1?refresh
{"body": "foo"}

{"acknowledged":true}
{"acknowledged":true}
{"_index":"groovy","_type":"doc","_id":"1","_version":1,"created":true}

4.19.2 Update the document using string formatting

This currently doesn't work, because String.format is not in the whitelist. I should probably add it, see <github issue here>.

POST /groovy/doc/1/_update
{
  "script": {
    "script": "ctx._source.body = String.format(\"%s: %d\", a, b)",
    "params": {
      "a": "bar",
      "b": 5
    }
  }
}

GET /groovy/doc/1
{}

// Not as powerful, because you can't specify things like decimal format, but
// still usable
POST /groovy/doc/1/_update
{
  "script": {
    "script": "ctx._source.body = \"${a}: ${b}\"",
    "params": {
      "a": "bar",
      "b": 5
    }
  }
}

GET /groovy/doc/1
{}

HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=UTF-8
Content-Length: 3079

{"error":"ElasticsearchIllegalArgumentException[failed to execute script]; nested: GroovyScriptCompilationException[MultipleCompilationErrorsException[startup failed:\nGeneral error during canonicalization: Method calls not allowed on [java.lang.String]\n\njava.lang.SecurityException: Method calls not allowed on [java.lang.String]\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitMethodCallExpression(SecureASTCustomizer.java:855)\n\tat org.codehaus.groovy.ast.expr.MethodCallExpression.visit(MethodCallExpression.java:64)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitBinaryExpression(SecureASTCustomizer.java:897)\n\tat org.codehaus.groovy.ast.expr.BinaryExpression.visit(BinaryExpression.java:49)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitExpressionStatement(SecureASTCustomizer.java:777)\n\tat org.codehaus.groovy.ast.stmt.ExpressionStatement.visit(ExpressionStatement.java:40)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitBlockStatement(SecureASTCustomizer.java:737)\n\tat org.codehaus.groovy.ast.stmt.BlockStatement.visit(BlockStatement.java:69)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer.call(SecureASTCustomizer.java:552)\n\tat org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1047)\n\tat org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:583)\n\tat org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:561)\n\tat org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:538)\n\tat groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:286)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:259)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:245)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:203)\n\tat org.elasticsearch.script.groovy.GroovyScriptEngineService.compile(GroovyScriptEngineService.java:119)\n\tat org.elasticsearch.script.ScriptService.getCompiledScript(ScriptService.java:353)\n\tat org.elasticsearch.script.ScriptService.compile(ScriptService.java:339)\n\tat org.elasticsearch.script.ScriptService.executable(ScriptService.java:463)\n\tat org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:183)\n\tat org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:176)\n\tat org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:170)\n\tat org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$1.run(TransportInstanceSingleOperationAction.java:187)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\n\n1 error\n]]; ","status":400}
{"_index":"groovy","_type":"doc","_id":"1","_version":2,"found":true,"_source":{"body":"bar: 5"}}
{"_index":"groovy","_type":"doc","_id":"1","_version":3}
{"_index":"groovy","_type":"doc","_id":"1","_version":3,"found":true,"_source":{"body":"bar: 5"}}

4.20 Does highlighting work with ngrams?

So talking to Ryan about removing support for the _analyzer field in the mapping. Ngrams is the viable alternative, but does it work with highlighting?

Short answer, yes it does work. Don't use _analyzer as we're probably going to remove it soon.

4.20.1 Create an index

DELETE /ngrams
{}

POST /ngrams
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "my_ngram": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": ["lowercase", "ngram_tf"]
        }
      },
      "filter": {
        "ngram_tf": {
          "type": "ngram",
          "min_gram": 2,
          "max_gram": 4
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "_analyzer": {
        "path": "custom_analyzer"
      },
      "properties": {
        "body": {
          "type": "string",
          "fields": {
            "ngram": {
              "type": "string",
              "analyzer": "my_ngram"
            },
            "ngram_postings": {
              "type": "string",
              "analyzer": "my_ngram",
              "index_options": "offsets"
            },
            "ngram_fvh": {
              "type": "string",
              "analyzer": "my_ngram",
              "term_vector": "with_positions_offsets"
            },
            "french_postings": {
              "type": "string",
              "index_options": "offsets"
            },
            "french_fvh": {
              "type": "string",
              "term_vector": "with_positions_offsets"
            }
          }
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.20.2 Index docs

POST /ngrams/doc/1?refresh
{
  "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi",
  "custom_analyzer": "french"
}

{"_index":"ngrams","_type":"doc","_id":"1","_version":1,"created":true}

4.20.3 Query with highlighting

POST /ngrams/_search?pretty
{
  "query": {
    "match_phrase": {
      "body": {
        "query": "musée Louvre",
        "analyzer": "french",
        "slop": 1
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {},
      "body.french_postings": {},
      "body.french_fvh": {},
      "body.ngram": {},
      "body.ngram_postings": {},
      "body.ngram_fvh": {}
    }
  }
}

POST /ngrams/_search?pretty
{
  "query": {
    "match_phrase": {
      "body.ngram": {
        "query": "musée Louvre",
        "analyzer": "my_ngram",
        "slop": 1
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {},
      "body.french_postings": {},
      "body.french_fvh": {},
      "body.ngram": {},
      "body.ngram_postings": {},
      "body.ngram_fvh": {}
    }
  }
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.16273327,
    "hits" : [ {
      "_index" : "ngrams",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.16273327,
      "_source":{
  "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi",
  "custom_analyzer": "french"
},
      "highlight" : {
        "body.french_fvh" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ],
        "body.french_postings" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ],
        "body.ngram" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ],
        "body" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ]
      }
    } ]
  }
}

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.9730363,
    "hits" : [ {
      "_index" : "ngrams",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.9730363,
      "_source":{
  "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi",
  "custom_analyzer": "french"
},
      "highlight" : {
        "body.ngram_fvh" : [ "Le <em>musée</em> du <em>Louvre</em> est <em>ouvert</em> <em>tous</em> les <em>jours</em> sauf le mardi" ],
        "body.ngram_postings" : [ "Le <em>musée</em> du <em>Louvre</em> est <em>ouvert</em> <em>tous</em> les <em>jours</em> sauf le mardi" ]
      }
    } ]
  }
}

4.21 Filter aggregations do not load field data

Using a field in a filter in aggregations does not load field data:

4.21.1 Create an index

DELETE /filteragg
{}

POST /filteragg
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "field1": {"type": "string", "index": "not_analyzed"},
        "field2": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.21.2 Index docs

POST /filteragg/doc/1
{"field1": "foo bar baz", "field2": "foo bar baz"}

POST /filteragg/doc/2
{"field1": "foo eggplant potato", "field2": "foo eggplant potato"}

POST /filteragg/_refresh
{}

{"_index":"filteragg","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"filteragg","_type":"doc","_id":"2","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.21.3 Query

POST /filteragg/_search?pretty
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "one": {
      "aggs": {
        "myterms": {
          "terms": {
            "field": "field1"
          }
        }
      },
      "filter": {
        "query": {
          "query_string": {
            "query": "field2:foo"
          }
        }
      }
    }
  }
}

{
  "doc_count": 2,
  "myterms": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "foo bar baz",
        "doc_count": 1
      },
      {
        "key": "foo eggplant potato",
        "doc_count": 1
      }
    ]
  }
}

4.21.4 Check the field data usage

See, only field1 has field data that has been loaded:

GET /_cat/fielddata?v
{}

id                     host          ip          node           total field1
IozvUOJoS_qHH9gO5Vgk2A Xanadu.domain 192.168.0.4 Silver Samurai  300b   300b

4.22 Determining why a shard will not be allocated

So, suppose you create an index but you can't figure out why shards won't allocate. There are a couple of ways to diagnose this like turning the logging level up, etc. However, you can use the reroute API to give a nice explanation as well:

4.22.1 Create an index that cannot be allocated

Here because there is 1 replica, it will not be able to be allocated

DELETE /disktest
{}

POST /disktest
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.22.2 Explaining why a shard cannot be allocated

We can see why shards would not be allocated, which is because a replica cannot be allocated on the same node that the primary is allocated on:

POST /_cluster/reroute?dry_run&explain&pretty
{
  "commands": [
    {
      "allocate": {
        "index": "disktest",
        "shard": 0,
        "node": "NiifNsi5QNObqfgG4i2PCA"
      }
    }
  ]
}

{
  "acknowledged" : true,
  "state" : {
    "version" : 30,
    "master_node" : "NiifNsi5QNObqfgG4i2PCA",
    "blocks" : { },
    "nodes" : {
      "NiifNsi5QNObqfgG4i2PCA" : {
        "name" : "Bobster",
        "transport_address" : "inet[/192.168.0.4:9300]",
        "attributes" : { }
      }
    },
    "routing_table" : {
      "indices" : {
        "disktest" : {
          "shards" : {
            "0" : [ {
              "state" : "STARTED",
              "primary" : true,
              "node" : "NiifNsi5QNObqfgG4i2PCA",
              "relocating_node" : null,
              "shard" : 0,
              "index" : "disktest"
            }, {
              "state" : "UNASSIGNED",
              "primary" : false,
              "node" : null,
              "relocating_node" : null,
              "shard" : 0,
              "index" : "disktest"
            } ]
          }
        }
      }
    },
    "routing_nodes" : {
      "unassigned" : [ {
        "state" : "UNASSIGNED",
        "primary" : false,
        "node" : null,
        "relocating_node" : null,
        "shard" : 0,
        "index" : "disktest"
      } ],
      "nodes" : {
        "NiifNsi5QNObqfgG4i2PCA" : [ {
          "state" : "STARTED",
          "primary" : true,
          "node" : "NiifNsi5QNObqfgG4i2PCA",
          "relocating_node" : null,
          "shard" : 0,
          "index" : "disktest"
        } ]
      }
    },
    "allocations" : [ ]
  },
  "explanations" : [ {
    "command" : "allocate",
    "parameters" : {
      "index" : "disktest",
      "shard" : 0,
      "node" : "NiifNsi5QNObqfgG4i2PCA",
      "allow_primary" : false
    },
    "decisions" : [ {
      "decider" : "same_shard",
      "decision" : "NO",
      "explanation" : "shard cannot be allocated on same node [NiifNsi5QNObqfgG4i2PCA] it already exists on"
    }, {
      "decider" : "filter",
      "decision" : "YES",
      "explanation" : "node passes include/exclude/require filters"
    }, {
      "decider" : "replica_after_primary_active",
      "decision" : "YES",
      "explanation" : "primary is already active"
    }, {
      "decider" : "throttling",
      "decision" : "YES",
      "explanation" : "below shard recovery limit of [2]"
    }, {
      "decider" : "enable",
      "decision" : "YES",
      "explanation" : "allocation disabling is ignored"
    }, {
      "decider" : "disable",
      "decision" : "YES",
      "explanation" : "allocation disabling is ignored"
    }, {
      "decider" : "awareness",
      "decision" : "YES",
      "explanation" : "no allocation awareness enabled"
    }, {
      "decider" : "shards_limit",
      "decision" : "YES",
      "explanation" : "total shard limit disabled: [-1] <= 0"
    }, {
      "decider" : "node_version",
      "decision" : "YES",
      "explanation" : "target node version [1.4.3] is same or newer than source node version [1.4.3]"
    }, {
      "decider" : "disk_threshold",
      "decision" : "YES",
      "explanation" : "only a single node is present"
    }, {
      "decider" : "snapshot_in_progress",
      "decision" : "YES",
      "explanation" : "shard not primary or relocation disabled"
    } ]
  } ]
}

4.23 Returning the scores matching documents in a scroll request

Sometimes you may want to issue a scroll, but you still want to return what the actual score for each document is. The Scroll API provides a way to do that using track_scores.

4.23.1 Create an index

DELETE /sctest
{}

POST /sctest
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.23.2 Index some documents

POST /sctest/doc/1
{"body": "foo"}

POST /sctest/doc/2
{"body": "foo bar foo baz"}

POST /sctest/doc/3
{"body": "fooaloo"}

POST /sctest/doc/4?refresh
{"body": "foo foo foo foo foo"}

{"_index":"sctest","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"sctest","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"sctest","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"sctest","_type":"doc","_id":"4","_version":1,"created":true}

4.23.3 Query

Instead of a regular query, we will perform a scan/scroll query over all of the results. I use track_scores: true here because without it Elasticsearch will not compute the score of each result.

POST /sctest/_search?scroll=1m&search_type=scan&pretty
{
  "query": {
    "match": {
      "body": "foo"
    }
  },
  "track_scores": true
}

{
  "_scroll_id" : "c2NhbjsxOzM2OmxQNEU4Mi1JVGphME1vbGR1SkJSMGc7MTt0b3RhbF9oaXRzOjM7",
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

And then using the scroll id from the previous response, we can see the documents and their scores (which would usually be 0.0 if track_scores were not set)

curl -XGET 'localhost:9200/_search/scroll?scroll=1m&pretty' -d'c2NhbjsxOzM2OmxQNEU4Mi1JVGphME1vbGR1SkJSMGc7MTt0b3RhbF9oaXRzOjM7'

{
  "_scroll_id" : "c2NhbjswOzE7dG90YWxfaGl0czozOw==",
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.0,
    "hits" : [ {
      "_index" : "sctest",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.0,
      "_source":{"body": "foo"}
    }, {
      "_index" : "sctest",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.70710677,
      "_source":{"body": "foo bar foo baz"}
    }, {
      "_index" : "sctest",
      "_type" : "doc",
      "_id" : "4",
      "_score" : 0.97827977,
      "_source":{"body": "foo foo foo foo foo"}
    } ]
  }
}

4.24 Inner hits example

Here's an example of nested doc type and inner hits, to retrieve the inner document that matched the nested query instead of the entire surrounding document.

4.24.1 Create an index

DELETE /inner
{}

POST /inner
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "task": {
          "type": "nested",
          "properties": {
            "name": {
              "type": "string"
            }
          }
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.24.2 Index docs

POST /inner/doc/1
{
  "task": [
    {
      "name": "foo"
    },
    {
      "name": "bar"
    }
  ]
}

POST /inner/_refresh
{}

{"_index":"inner","_type":"doc","_id":"1","_version":1,"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.24.3 Query

POST /inner/_search?pretty
{
  "query": {
    "nested": {
      "path": "task",
      "query": {
        "match": {
          "name": "foo"
        }
      },
      "inner_hits": {}
    }
  }
}

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.4054651,
    "hits" : [ {
      "_index" : "inner",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 1.4054651,
      "_source":{
  "task": [
    {
      "name": "foo"
    },
    {
      "name": "bar"
    }
  ]
},
      "inner_hits" : {
        "task" : {
          "hits" : {
            "total" : 1,
            "max_score" : 1.4054651,
            "hits" : [ {
              "_index" : "inner",
              "_type" : "doc",
              "_id" : "1",
              "_nested" : {
                "field" : "task",
                "offset" : 0
              },
              "_score" : 1.4054651,
              "_source":{"name":"foo"}
            } ]
          }
        }
      }
    } ]
  }
}

4.25 Does setting an analyzer and not_analyzed make ES unhappy?

ES doesn't care that you said "not_analyzed" and "standard analyzer"

DELETE /analyzer-test
{}

POST /analyzer-test
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "name": {
          "type": "string",
          "index": "not_analyzed",
          "analyzer": "standard"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.26 Removing norms from the `_all` field dynamically

So, technically you should be able to remove norms from the _all field dynamically.

4.26.1 Create an index

DELETE /ntest
{}

POST /ntest
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "_all": {
        "enabled": true
      },
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.26.2 Index docs

POST /ntest/doc/1?refresh
{"body": "foo bar baz"}

POST /ntest/doc/2?refresh
{"body": "bar baz eggplant"}

POST /ntest/doc/3?refresh
{"body": "baz"}

POST /ntest/doc/4?refresh
{"body": "eggplant"}

{"_index":"ntest","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"ntest","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"ntest","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"ntest","_type":"doc","_id":"4","_version":1,"created":true}

Check the segments:

GET /ntest/_segments?pretty
{}

{
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "indices" : {
    "ntest" : {
      "shards" : {
        "0" : [ {
          "routing" : {
            "state" : "STARTED",
            "primary" : true,
            "node" : "LiJ8aXvGSRuNhwcT5Pflgg"
          },
          "num_committed_segments" : 0,
          "num_search_segments" : 4,
          "segments" : {
            "_0" : {
              "generation" : 0,
              "num_docs" : 1,
              "deleted_docs" : 0,
              "size_in_bytes" : 2337,
              "memory_in_bytes" : 3298,
              "committed" : false,
              "search" : true,
              "version" : "4.10.4",
              "compound" : true
            },
            "_1" : {
              "generation" : 1,
              "num_docs" : 1,
              "deleted_docs" : 0,
              "size_in_bytes" : 2362,
              "memory_in_bytes" : 3298,
              "committed" : false,
              "search" : true,
              "version" : "4.10.4",
              "compound" : true
            },
            "_2" : {
              "generation" : 2,
              "num_docs" : 1,
              "deleted_docs" : 0,
              "size_in_bytes" : 2289,
              "memory_in_bytes" : 3298,
              "committed" : false,
              "search" : true,
              "version" : "4.10.4",
              "compound" : true
            },
            "_3" : {
              "generation" : 3,
              "num_docs" : 1,
              "deleted_docs" : 0,
              "size_in_bytes" : 2324,
              "memory_in_bytes" : 3298,
              "committed" : false,
              "search" : true,
              "version" : "4.10.4",
              "compound" : true
            }
          }
        } ]
      }
    }
  }
}

4.26.3 Query

GET /ntest/_search?pretty
{
  "query": {
    "match": {
      "_all": "eggplant"
    }
  },
  "explain": true
}

[
  {
    "_explanation": {
      "details": [
        {
          "details": [
            {
              "details": [
                {
                  "description": "termFreq=1.0",
                  "value": 1
                }
              ],
              "description": "tf(freq=1.0), with freq of:",
              "value": 1
            },
            {
              "description": "idf(docFreq=2, maxDocs=4)",
              "value": 1.287682
            },
            {
              "description": "fieldNorm(doc=0)",
              "value": 1
            }
          ],
          "description": "fieldWeight in 0, product of:",
          "value": 1.287682
        }
      ],
      "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:",
      "value": 1.287682
    },
    "_source": {
      "body": "eggplant"
    },
    "_score": 1.287682,
    "_id": "4",
    "_type": "doc",
    "_index": "ntest",
    "_node": "LiJ8aXvGSRuNhwcT5Pflgg",
    "_shard": 0
  },
  {
    "_explanation": {
      "details": [
        {
          "details": [
            {
              "details": [
                {
                  "description": "termFreq=1.0",
                  "value": 1
                }
              ],
              "description": "tf(freq=1.0), with freq of:",
              "value": 1
            },
            {
              "description": "idf(docFreq=2, maxDocs=4)",
              "value": 1.287682
            },
            {
              "description": "fieldNorm(doc=0)",
              "value": 0.5
            }
          ],
          "description": "fieldWeight in 0, product of:",
          "value": 0.643841
        }
      ],
      "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:",
      "value": 0.643841
    },
    "_source": {
      "body": "bar baz eggplant"
    },
    "_score": 0.643841,
    "_id": "2",
    "_type": "doc",
    "_index": "ntest",
    "_node": "LiJ8aXvGSRuNhwcT5Pflgg",
    "_shard": 0
  }
]

4.26.4 Update the norms mapping

PUT /ntest/_mapping/doc
{
  "_all": {
    "enabled": true,
    "norms": {
      "enabled": false
    }
  }
}

GET /ntest/_mapping?pretty
{}

{"acknowledged":true}
{
  "ntest" : {
    "mappings" : {
      "doc" : {
        "_all" : {
          "enabled" : true,
          "omit_norms" : true
        },
        "properties" : {
          "body" : {
            "type" : "string"
          }
        }
      }
    }
  }
}

Then force merge:

POST /ntest/_optimize?max_num_segments=1
{}

{"_shards":{"total":1,"successful":1,"failed":0}}

4.26.5 Search again

GET /ntest/_search?pretty
{
  "query": {
    "match": {
      "_all": "eggplant"
    }
  },
  "explain": true
}

[
  {
    "_explanation": {
      "details": [
        {
          "details": [
            {
              "details": [
                {
                  "description": "termFreq=1.0",
                  "value": 1
                }
              ],
              "description": "tf(freq=1.0), with freq of:",
              "value": 1
            },
            {
              "description": "idf(docFreq=2, maxDocs=4)",
              "value": 1.287682
            },
            {
              "description": "fieldNorm(doc=2)",
              "value": 1
            }
          ],
          "description": "fieldWeight in 2, product of:",
          "value": 1.287682
        }
      ],
      "description": "weight(_all:eggplant in 2) [PerFieldSimilarity], result of:",
      "value": 1.287682
    },
    "_source": {
      "body": "eggplant"
    },
    "_score": 1.287682,
    "_id": "4",
    "_type": "doc",
    "_index": "ntest",
    "_node": "LiJ8aXvGSRuNhwcT5Pflgg",
    "_shard": 0
  },
  {
    "_explanation": {
      "details": [
        {
          "details": [
            {
              "details": [
                {
                  "description": "termFreq=1.0",
                  "value": 1
                }
              ],
              "description": "tf(freq=1.0), with freq of:",
              "value": 1
            },
            {
              "description": "idf(docFreq=2, maxDocs=4)",
              "value": 1.287682
            },
            {
              "description": "fieldNorm(doc=0)",
              "value": 0.5
            }
          ],
          "description": "fieldWeight in 0, product of:",
          "value": 0.643841
        }
      ],
      "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:",
      "value": 0.643841
    },
    "_source": {
      "body": "bar baz eggplant"
    },
    "_score": 0.643841,
    "_id": "2",
    "_type": "doc",
    "_index": "ntest",
    "_node": "LiJ8aXvGSRuNhwcT5Pflgg",
    "_shard": 0
  }
]

4.27 Combining scores from BM25 and TF-IDF indices

With Lucene 6.0, BM25 will be the default similarity, so I'm curious how the scores will combine between older (TF-IDF) indices versus newer BM25 indices.

4.27.1 Create a couple of indices

DELETE /tfidf,bm25
{}

POST /tfidf
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string"
        }
      }
    }
  }
}

POST /bm25
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string",
          "similarity": "BM25"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}
{"acknowledged":true}

4.27.2 Index the same documents into each index

POST /tfidf/doc/1
{"body": "foo"}

POST /tfidf/doc/2
{"body": "foo bar"}

POST /tfidf/doc/3
{"body": "foo bar baz"}

POST /bm25/doc/1
{"body": "foo"}

POST /bm25/doc/2
{"body": "foo bar"}

POST /bm25/doc/3
{"body": "foo bar baz"}

POST /tfidf,bm25/_refresh
{}

{"_index":"tfidf","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"tfidf","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"tfidf","_type":"doc","_id":"3","_version":1,"created":true}
{"_index":"bm25","_type":"doc","_id":"1","_version":1,"created":true}
{"_index":"bm25","_type":"doc","_id":"2","_version":1,"created":true}
{"_index":"bm25","_type":"doc","_id":"3","_version":1,"created":true}
{"_shards":{"total":2,"successful":2,"failed":0}}

4.27.3 Perform the query

Scores are NOT normalized between the BM25 and TF-IDF indices.

POST /tfidf,bm25/_search?pretty
{
  "query": {
    "match": {
      "body": "foo"
    }
  }
  // , "explain": true
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 6,
    "max_score" : 0.71231794,
    "hits" : [ {
      "_index" : "tfidf",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.71231794,
      "_source":{"body": "foo"}
    }, {
      "_index" : "tfidf",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.4451987,
      "_source":{"body": "foo bar"}
    }, {
      "_index" : "tfidf",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.35615897,
      "_source":{"body": "foo bar baz"}
    }, {
      "_index" : "bm25",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.16786803,
      "_source":{"body": "foo"}
    }, {
      "_index" : "bm25",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.11980793,
      "_source":{"body": "foo bar"}
    }, {
      "_index" : "bm25",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.09476421,
      "_source":{"body": "foo bar baz"}
    } ]
  }
}

4.28 Searching with a slop phrase has a higher score for adjacent terms

Basically, if you have two documents that match, the one with the lower slop should have a higher score than the one that matches with slop.

4.28.1 Create an index

DELETE /sloptest
{}

POST /sloptest
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {"type": "string"}
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.28.2 Index docs

// foo and baz have a slop-match of 1
POST /sloptest/doc/1
{"body": "foo bar baz"}

// foo and baz have a slop-match of 0
POST /sloptest/doc/2
{"body": "foo baz bar"}

POST /sloptest/_refresh
{}

{"_index":"sloptest","_type":"doc","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"sloptest","_type":"doc","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.28.3 Query

Notice the score difference:

POST /sloptest/_search?pretty
{
  "query": {
    "match": {
      "body": {
        "type": "phrase",
        "query": "foo baz",
        "slop": 1
      }
    }
  }
}

[
  {
    "_source": {
      "body": "foo baz bar"
    },
    "_score": 0.5945348,
    "_id": "2",
    "_type": "doc",
    "_index": "sloptest"
  },
  {
    "_source": {
      "body": "foo bar baz"
    },
    "_score": 0.4203996,
    "_id": "1",
    "_type": "doc",
    "_index": "sloptest"
  }
]

4.29 Circular parent-child references from Grandparent to Grandchild

Someone asked at a training whether it was possible to have circular parent-child relationships. With a single parent/child relationship this is expressly disabled, but if you go three levels deep…

4.29.1 Create an index with a grandparent, parent, and child

DELETE /test
{}

POST /test
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "foo": {
      "_parent": {
        "type": "baz"
      },
      "properties": {
        "body": {
          "type": "string"
        }
      }
    },
    "bar": {
      "_parent": {
        "type": "foo"
      },
      "properties": {
        "body": {
          "type": "string"
        }
      }
    },
    "baz": {
      "_parent": {
        "type": "bar"
      },
      "properties": {
        "body": {
          "type": "string"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.29.2 Index docs

Index two documents per type with references to their parents/grandparents.

POST /test/foo/1?parent=1
{"body": "cat"}

POST /test/foo/2?parent=2
{"body": "dog"}

POST /test/bar/1?parent=1
{"body": "pig"}

POST /test/bar/2?parent=2
{"body": "llama"}

POST /test/baz/1?parent=1
{"body": "duck"}

POST /test/baz/2?parent=2
{"body": "emu"}

POST /test/_refresh
{}

{"_index":"test","_type":"foo","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"test","_type":"foo","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"test","_type":"bar","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"test","_type":"bar","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"test","_type":"baz","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"test","_type":"baz","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_shards":{"total":1,"successful":1,"failed":0}}

4.29.3 Do some really complicated parent/child querying

So, the circular parent/child reference actually works. I don't recommend that you actually do this in practice though.

GET /test/foo/_search?pretty
{
  "query": {
    "bool": {
      "must": [
        {"match": {"body": "cat"}},
        {
          "has_child": {
            "type": "bar",
            "query": {
              "bool": {
                "must": [
                  {"match": {"body": "pig"}},
                  {
                    "has_child": {
                      "type": "baz",
                      "query": {
                        "bool": {
                          "must": [
                            {"match": {"body": "duck"}},
                            {
                              "has_child": {
                                "type": "foo",
                                "query": {
                                  "match": {
                                    "body": "cat"
                                  }
                                }
                              }
                            }
                          ]
                        }
                      }
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 2.3246877,
    "hits" : [ {
      "_index" : "test",
      "_type" : "foo",
      "_id" : "1",
      "_score" : 2.3246877,
      "_routing" : "1",
      "_parent" : "1",
      "_source":{"body": "cat"}
    } ]
  }
}

4.30 Geo distance sorting

4.30.1 Create an index with a mapping that uses a geo_point field

DELETE /myindex
{}

POST /myindex
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "number_of_replicas": 0
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string"
        },
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

{"acknowledged":true}
{"acknowledged":true}

4.30.2 Index a few documents with geo distance

POST /myindex/doc/1
{
  "body": "mexican food",
  "location": {
    "lat": 41.12,
    "lon": -71.34
  }
}

POST /myindex/doc/2
{
  "body": "chinese food",
  "location": {
    "lat": 39.01,
    "lon": -75.00
  }
}

POST /myindex/doc/3
{
  "body": "dutch food",
  "location": {
    "lat": 25.12,
    "lon": -31.00
  }
}

{"_index":"myindex","_type":"doc","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"3","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}

4.30.3 Perform the query with geo_distance sorting

POST /myindex/_search?pretty
{
  "query": {
    "match": {
      "body": "food"
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": {
          "lat": 40,
          "lon": -70
        },
        "order": "asc",
        "unit": "km"
      }
    },
    "_score"
  ]
}

{
  "took" : 27,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : null,
    "hits" : [ {
      "_index" : "myindex",
      "_type" : "doc",
      "_id" : "1",
      "_score" : 0.4451987,
      "_source" : {
        "body" : "mexican food",
        "location" : {
          "lat" : 41.12,
          "lon" : -71.34
        }
      },
      "sort" : [ 168.24429169579741, 0.4451987 ]
    }, {
      "_index" : "myindex",
      "_type" : "doc",
      "_id" : "2",
      "_score" : 0.4451987,
      "_source" : {
        "body" : "chinese food",
        "location" : {
          "lat" : 39.01,
          "lon" : -75.0
        }
      },
      "sort" : [ 442.7024334265092, 0.4451987 ]
    }, {
      "_index" : "myindex",
      "_type" : "doc",
      "_id" : "3",
      "_score" : 0.4451987,
      "_source" : {
        "body" : "dutch food",
        "location" : {
          "lat" : 25.12,
          "lon" : -31.0
        }
      },
      "sort" : [ 3972.3297497833664, 0.4451987 ]
    } ]
  }
}

5 Logstash Examples

A miscellaneous gathering of Logstash configurations that do various things

5.1 Split log files into separate files by time or node

When a customer gives 10 different elasticsearch log files, it can be useful to separate them all into multi-node logs separate by hour, or take a huge log and separate it into node-specific files.

This logstash config does that, which I find quite useful for correlating logs for the time of events for multiple nodes.

input {
  stdin {}
}

filter {
  multiline {
    # message starting with [ must be the next one, works until 2099
    pattern => "^\[20"
    negate => "true"
    what => "previous"
  }

  grok {
    # also do inline trimming
    match => [ "message", "\[\s*%{DATA:date}\s*\]\[\s*%{DATA:loglevel}\s*\]\[%{DATA:class}\s*\] %{GREEDYDATA:logline}" ]
  }

  date {
    match => [ "date", "YYYY-MM-dd HH:mm:ss,SSS" ]
    timezone => "UTC"
  }

  # the logline starting with "[" might be a nodename
  if [logline] =~ /^\[/ {
    grok {
      match => [ "logline", "\[%{DATA:node}\] %{GREEDYDATA}" ]
    }
  }
}

output {
  stdout { codec => dots }

  # Output log files as hourly log files
  file {
    path => "es-%{+YYYY-MM-dd.HH}:00.log"
    message_format => "%{message}"
  }

  # Split a log by nodes
  # if [node] {
  #   file {
  #     path => "es-%{node}-%{+YYYY-MM-dd}.log"
  #     message_format => "%{message}"
  #   }
  # } else {
  #   file {
  #     path => "es-NONE-%{+YYYY-MM-dd}.log"
  #     message_format => "%{message}"
  #   }
  # }
}

You can change whether to dump them by time only or split them by node by commenting and uncommenting the different outputs.

It can then be used like the following:

$ ls *.log
log1.log log2.log log3.log
$ cat *.log | bin/logstash -f split.conf

Where split.conf is the configuration file above. It will produce new log files.

5.2 Capture edits of Wikipedia pages from IRC

Inputs don't always have to be files

input {
  irc {
    type => 'wikipedia'
    host => 'irc.wikimedia.org'
    nick => 'logstash-wikipedia'
    # change this to whatever you want... de.wikipedia for the german wikipedia, etc
    channels => ['#en.wikipedia']
  }
}

filter {
  # remove some weird encoding stuff from IRC
  mutate {
    gsub => [
      "message", "\u000302", "",
      "message", "\u000303", "",
      "message", "\u000307", "",
      "message", "\u000310", "",
      "message", "\u000314", "",
      "message", "\u00034", "",
      "message", "\u00035", "",
      "message", "\u0003", ""
    ]
  }
  # extract page and user
  grok {
    match => [ "message", "\[\[%{GREEDYDATA:page}\]\]%{GREEDYDATA} \* %{GREEDYDATA:user} \* %{GREEDYDATA}" ]
  }
}

output {
  stdout {
    codec => 'rubydebug'
  }
  elasticsearch {
    protocol => 'http'
    host => 'localhost'
    index => 'wikipedia-edits'
  }
}

Run with bin/logstash -f wiki-edits.conf and watch the edits roll in!