Elasticsearch and Logstash notes
Table of Contents
- 1. Introduction
- 2. Design
- 3. Presentations
- 4. Elasticsearch Examples
- 4.1. Target where field values should be retrieved from
- 4.2. Create a new array field or append to it with a script
- 4.3. De-compound words to transform large conjunctions to multiple tokens
- 4.4. Excluding a field from _source is still searchable
- 4.5. Output format for Geo-point data types
- 4.6. Using an array of types in an
ids
query - 4.7. Changing the default postings format for a field
- 4.8.
top_hits
aggregation with a Groovy script (_score
) - 4.9. Using the
field_value_factor
function in a function score query - 4.10. Naming a query to return which part of the query matched
- 4.11. Dynamically change logging level of Elasticsearch servers
- 4.12. Blocking a cluster from reading or writing
- 4.13. Sorting with a script
- 4.14. Doc-values with arrays of object fields
- 4.15. Wikimedia's source_regex query equivalent
- 4.16. String interpretation in Groovy Scripts
- 4.17. Using BM25 or DFR instead of TF-IDF
- 4.18. Get the current time in a Groovy script
- 4.19. Formatting strings in Groovy scripts
- 4.20. Does highlighting work with ngrams?
- 4.21. Filter aggregations do not load field data
- 4.22. Determining why a shard will not be allocated
- 4.23. Returning the scores matching documents in a scroll request
- 4.24. Inner hits example
- 4.25. Does setting an analyzer and not_analyzed make ES unhappy?
- 4.26. Removing norms from the
_all
field dynamically - 4.27. Combining scores from BM25 and TF-IDF indices
- 4.28. Searching with a slop phrase has a higher score for adjacent terms
- 4.29. Circular parent-child references from Grandparent to Grandchild
- 4.30. Geo distance sorting
- 5. Logstash Examples
1 Introduction
This is a list of tests, examples, and scripts that I have created in order to either reproduce an issue, test a bugfix, or validate a behavior.
Most of these examples will either be in a shell
format, relying on the use of
curl
, or they will be in es-mode format, which will also work in Sense. If you
are reading this as an org-mode file, you can tangle blocks to generate scripts
if so desired.
If you are an Emacs user and want the original, plain-text .org
file, replace
the .html for any page with .org to download the file.
This file was last exported: 2016-08-04 Thu 09:37
2 Design
I do a lot of design in org-mode also. My definition of "design" is really more of a note-taking or measurement-gathering example, so some of these may be more like scratch pads and some will be more like concrete design docs.
As with any of this information, it could be out of date, or it could be entirely wrong as I test against an older version of Elasticsearch.
3 Presentations
I've given a few Elasticsearch presentations, the ones that are publicly available are listed here:
4 Elasticsearch Examples
4.1 Target where field values should be retrieved from
Sometimes, it can be useful to tell Elasticsearch where to retrieve data from, because it can be returned in different formats depending on where you get it.
4.1.1 Create the index
Create 3 different string fields, where:
- _source is stored (body1)
- the field is stored by Lucene (body2)
- the field is not stored at all (body3)
DELETE /4492 {} POST /4492 { "mappings": { "doc": { "_source": { "enabled": true, "includes": ["body1"], "excludes": ["body2", "body3"] }, "properties": { "body1": {"type": "string"}, "body2": {"type": "string", "store": true}, "body3": {"type": "string", "store": false}, "when": {"type": "date", "store": false, "format": "basic_date_time"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.1.2 Index docs
Index two documents with the different type storage options. The first document has a date indexed as a string, the other as an integer.
POST /4492/doc/1?refresh { "body1": "foo", "body2": "foo", "body3": "foo", "when": "20140113T121628.345-0700" } POST /4492/doc/2?refresh { "body1": "bar", "body2": "bar", "body3": "bar", "when": 1389636769 }
{"_index":"4492","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"4492","_type":"doc","_id":"2","_version":1,"created":true}
4.1.3 Old-style Query (no sources specified)
Can't retrieve the body3
and when
fields here, because they aren't stored in
either _source
or Lucene.
POST /4492/_search?pretty { "query": { "match_all": {} }, "fields": ["body1", "body2", "body3", "when"] }
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "4492", "_type" : "doc", "_id" : "1", "_score" : 1.0, "fields" : { "body1" : [ "foo" ], "body2" : [ "foo" ] } }, { "_index" : "4492", "_type" : "doc", "_id" : "2", "_score" : 1.0, "fields" : { "body1" : [ "bar" ], "body2" : [ "bar" ] } } ] } }
4.1.4 New-style Query (fielddata_fields)
Retrieving the body3
and when
fields from the field data cache, notice that
the when
field is always returned as a number, even if it was sent to
Elasticsearch as a string.
POST /4492/_search?pretty { "query": { "match_all": {} }, "fields": ["body1", "body2"], "fielddata_fields": ["body3", "when"] }
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "4492", "_type" : "doc", "_id" : "1", "_score" : 1.0, "fields" : { "body2" : [ "foo" ], "body3" : [ "foo" ], "body1" : [ "foo" ], "when" : [ 1389640588345 ] } }, { "_index" : "4492", "_type" : "doc", "_id" : "2", "_score" : 1.0, "fields" : { "body2" : [ "bar" ], "body3" : [ "bar" ], "body1" : [ "bar" ], "when" : [ 1389636769 ] } } ] } }
4.1.5 Script fields query
Retrieving body3
and when
as script fields, this also uses fielddata, but
will be a bit slower because it goes through script execution, using
fielddata_fields
is a better way to do this.
POST /4492/_search?pretty { "query": { "match_all": {} }, "fields": ["body1", "body2"], "script_fields": { "body3": { "script": "doc[\"body3\"].value" }, "when": { "script": "doc[\"when\"].value" } } }
{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "4492", "_type" : "doc", "_id" : "1", "_score" : 1.0, "fields" : { "body2" : [ "foo" ], "body3" : [ "foo" ], "body1" : [ "foo" ], "when" : [ 1389640588345 ] } }, { "_index" : "4492", "_type" : "doc", "_id" : "2", "_score" : 1.0, "fields" : { "body2" : [ "bar" ], "body3" : [ "bar" ], "body1" : [ "bar" ], "when" : [ 1389636769 ] } } ] } }
4.2 Create a new array field or append to it with a script
If the array doesn't already exist, it needs to be created, this is an example script to show doing just that.
4.2.1 Create the index
DELETE /test {} POST /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "name": { "type": "string" }, "tags": { "type": "nested", "properties": { "innerName": { "type": "string" }, "value": { "type": "long" } } } } } } }
{"acknowledged":true} {"acknowledged":true}
4.2.2 Index doc
POST /test/doc/1?refresh {"name": "Mike"}
{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}
4.3 De-compound words to transform large conjunctions to multiple tokens
In this example, the text "catdogmouse" can be transformed into the different tokens "cat", "doc", and "mouse" using a decompounding token filter.
4.3.1 Create the index
DELETE /decom {} POST /decom { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index": { "analysis": { "analyzer": { "decom_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["decom_filter"] } }, "filter": { "decom_filter": { "type": "dictionary_decompounder", "word_list": ["cat", "dog", "mouse"] } } } } }, "mappings": { "doc": { "properties": { "body": { "type": "string", "analyzer": "decom_analyzer" } } } } }
{"acknowledged":true} {"acknowledged":true}
4.3.2 Analyze some text
es-mode requires the body of the request be inside a "{}", which is a bug I need to fix…
POST /decom/_analyze?field=body&pretty {racecatthings} POST /decom/_analyze?field=body&pretty {catdogmouse}
{ "tokens" : [ { "token" : "racecatthings", "start_offset" : 1, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "cat", "start_offset" : 1, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 1 } ] } { "tokens" : [ { "token" : "catdogmouse", "start_offset" : 1, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "cat", "start_offset" : 1, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "dog", "start_offset" : 1, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "mouse", "start_offset" : 1, "end_offset" : 12, "type" : "<ALPHANUM>", "position" : 1 } ] }
4.4 Excluding a field from _source is still searchable
Demonstrating that a field not contained in the _source
is still searchable.
4.4.1 Create the index
DELETE /exs-filter {} POST /exs-filter "mappings": { "doc": { "_source": { "excludes": ["ratings"] }, "properties": { "body": {"type": "string"}, "ratings": {"type": "string"} } } } }
{"error":"IndexMissingException[[exs-filter] missing]","status":404} {"ok":true,"acknowledged":true}
4.4.2 Index some docs
POST /exs-filter/doc/1 {"body": "foo", "ratings": "bar"} POST /exs-filter/_refresh {}
{"ok":true,"_index":"exs-filter","_type":"doc","_id":"1","_version":1} {"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}
4.4.3 Perform the query
curl -XPOST 'localhost:9200/exs-filter/_search?pretty' -d' { "query": { "filtered": { "filter": { "term": { "ratings": "bar" } } } } }'
{ "took" : 24, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "exs-filter", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source" : {"body":"foo"} } ] } }
4.5 Output format for Geo-point data types
Someone recently asked which format geo data was returned in, it is returned in the format it was indexed in, this example demonstrates the different formats a geo-point can be indexed in.
4.5.1 Create the index
DELETE /test {} POST /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "geo_point"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.5.2 Index docs
POST /test/doc/1 {"body": { "lat": 41.12, "lon": -71.34 } } POST /test/doc/2 {"body": "41.12,-71.34"} POST /test/doc/3 {"body": "drm3btev3e86"} POST /test/doc/4 {"body": [-71.34, 41.12]} POST /test/_refresh {}
{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"test","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"test","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"test","_type":"doc","_id":"4","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.5.3 Query
POST /test/_search?pretty&fields=_source,body { "query": { "match_all": {} } }
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 4, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source":{"body": { "lat": 41.12, "lon": -71.34 } } }, { "_index" : "test", "_type" : "doc", "_id" : "2", "_score" : 1.0, "_source":{"body": "41.12,-71.34"}, "fields" : { "body" : [ "41.12,-71.34" ] } }, { "_index" : "test", "_type" : "doc", "_id" : "3", "_score" : 1.0, "_source":{"body": "drm3btev3e86"}, "fields" : { "body" : [ "drm3btev3e86" ] } }, { "_index" : "test", "_type" : "doc", "_id" : "4", "_score" : 1.0, "_source":{"body": [-71.34, 41.12]}, "fields" : { "body" : [ -71.34, 41.12 ] } } ] } }
4.6 Using an array of types in an ids
query
Even though it's "type" and not "types", multiple types can be specified as an array.
4.6.1 Create the index
DELETE /test {} POST /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc1": { "properties": { "body": {"type": "string"} } }, "doc2": { "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.6.2 Index docs
POST /test/doc1/1 {"body": "foo"} POST /test/doc2/2 {"body": "foo"} POST /test/_refresh {}
{"_index":"test","_type":"doc1","_id":"1","_version":1,"created":true} {"_index":"test","_type":"doc2","_id":"2","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.6.3 Query
POST /test/_search?pretty { "query": { "ids": { "type": ["doc1", "doc2"], "values": ["1", "2"] } } }
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "doc1", "_id" : "1", "_score" : 1.0, "_source":{"body": "foo"} }, { "_index" : "test", "_type" : "doc2", "_id" : "2", "_score" : 1.0, "_source":{"body": "foo"} } ] } }
4.7 Changing the default postings format for a field
For instance, working around bloom filter generation, or if you want to live on the edge and use a non-supported format (don't do this).
4.8 top_hits
aggregation with a Groovy script (_score
)
Scripting can be combined with the top_hits
aggregation for custom scoring of
the joined hits. Not saying you should do this, but you can if you need to…
4.8.1 Create the index
DELETE /test {} POST /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"}, "domain": {"type": "integer"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.8.2 Index docs
POST /test/doc/1 {"body": "elections", "domain": 1} POST /test/doc/2 {"body": "nope elections", "domain": 2} POST /test/doc/3 {"body": "nope", "domain": 2} POST /test/_refresh {}
{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"test","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"test","_type":"doc","_id":"3","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.8.3 Query
POST /test/_search?pretty { "query": { "match": { "body": "elections" } }, "aggs": { "top-sites": { "terms": { "field": "domain", "order": { "top_hit": "desc" } }, "aggs": { "top_tags_hits": { "top_hits": {} }, "top_hit" : { "max": { "script": "_score", "lang": "groovy" } } } } } }
{ "took" : 669, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source":{"body": "elections", "domain": 1} }, { "_index" : "test", "_type" : "doc", "_id" : "2", "_score" : 0.625, "_source":{"body": "nope elections", "domain": 2} } ] }, "aggregations" : { "top-sites" : { "buckets" : [ { "key" : 2, "doc_count" : 1, "top_hit" : { "value" : 0.0 }, "top_tags_hits" : { "hits" : { "total" : 1, "max_score" : 0.625, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "2", "_score" : 0.625, "_source":{"body": "nope elections", "domain": 2} } ] } } }, { "key" : 1, "doc_count" : 1, "top_hit" : { "value" : 0.0 }, "top_tags_hits" : { "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source":{"body": "elections", "domain": 1} } ] } } } ] } } }
4.9 Using the field_value_factor
function in a function score query
By far the most common use case I see for function_score
is multiplying the
score of a document by some field inside the document, whether it be star rating
for hotels, or popularity for foods. So instead of requiring the user to write
an Groovy script, it would be nice if we could provide an easy way to do this.
Source here: https://github.com/dakrone/… Defunct, this has been merged to
Elasticsearch.
4.9.1 Create the index
DELETE /fvfs {} POST /fvfs { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"}, "popularity": {"type": "integer"} } } } }
HTTP/1.1 404 Not Found Content-Type: application/json; charset=UTF-8 Content-Length: 62 {"error":"IndexMissingException[[fvfs] missing]","status":404} {"acknowledged":true}
4.9.2 Index docs
POST /fvfs/doc/1 {"body": "foo foo", "popularity": 7} POST /fvfs/doc/2 {"body": "foo", "popularity": 5} POST /fvfs/doc/3 {"body": "foo", "popularity": [2, 99]} POST /fvfs/doc/4 {"body": "foo eggplant", "popularity": 0} POST /fvfs/doc/5 {"body": "foo bar"}
{"_index":"fvfs","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"fvfs","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"fvfs","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"fvfs","_type":"doc","_id":"4","_version":1,"created":true} {"_index":"fvfs","_type":"doc","_id":"5","_version":1,"created":true}
4.9.3 Query
POST /fvfs/_search?pretty { "query": { "function_score": { "query": { "simple_query_string": { "query": "foo", "fields": ["body"] } }, "functions": [ { "filter": { "range": { "popularity": { "lte": 100 } } }, "field_value_factor": { "field": "popularity", "factor": 3.5, "modifier": "log2p" } } ], "score_mode": "max", "boost_mode": "sum" } }, "explain": false }
{ "took" : 90, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 5, "max_score" : 2.1459785, "hits" : [ { "_index" : "fvfs", "_type" : "doc", "_id" : "1", "_score" : 2.1459785, "_source":{"body": "foo foo", "popularity": 7} }, { "_index" : "fvfs", "_type" : "doc", "_id" : "2", "_score" : 2.107713, "_source":{"body": "foo", "popularity": 5} }, { "_index" : "fvfs", "_type" : "doc", "_id" : "3", "_score" : 1.7719209, "_source":{"body": "foo", "popularity": [2, 99]} }, { "_index" : "fvfs", "_type" : "doc", "_id" : "5", "_score" : 1.511049, "_source":{"body": "foo bar"} }, { "_index" : "fvfs", "_type" : "doc", "_id" : "4", "_score" : 0.812079, "_source":{"body": "foo eggplant", "popularity": 0} } ] } }
4.10 Naming a query to return which part of the query matched
Sometimes people ask how they can tell which part of a query matched a
particular document, ES queries all support the _name
field, which is then
returned in the hits to indicate which of the queries matched.
4.10.1 Create an index
DELETE /named {} POST /named { "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } }
{"ok":true,"acknowledged":true} {"ok":true,"acknowledged":true}
POST /named/doc/1 {"body": "foo"} POST /named/_refresh {}
{"ok":true,"_index":"named","_type":"doc","_id":"1","_version":1} {"ok":true,"_shards":{"total":10,"successful":5,"failed":0}}
POST /named/_search?pretty { "query": { "bool": { "should": [ { "term": { "body": { "_name": "blah", "boost": 1.1, "value": "foo" } } } ] } } }
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.30685282, "hits" : [ { "_index" : "named", "_type" : "doc", "_id" : "1", "_score" : 0.30685282, "_source" : {"body": "foo"}, "matched_filters" : [ "blah" ] } ] } }
4.11 Dynamically change logging level of Elasticsearch servers
I always forget this, so here is how to do it dynamically with the cluster update settings API:
PUT /_cluster/settings { "transient": { // change the root logging level "logger._root": "DEBUG", // set it for a regular namespace, the "org.elasticsearch" is not required "logger.recovery": "TRACE" } }
{"acknowledged":true,"persistent":{},"transient":{"logger":{"_root":"DEBUG","recovery":"TRACE"}}}
4.12 Blocking a cluster from reading or writing
Sometimes you don't want anyone reading or writing to your cluster, you can do this with a cluster block, which is not well documented:
4.12.1 Create an index and index a document
DELETE /test {} POST /test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 } }
{"acknowledged":true} {"acknowledged":true}
POST /test/doc/1 {"body": "foo"}
{"_index":"test","_type":"doc","_id":"1","_version":1,"created":true}
4.12.2 Apply a cluster block
PUT /_cluster/settings { "transient": { // the whole cluster is read-only now "cluster.blocks.read_only": true } }
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"blocks":{"read_only":"true"}}}}
4.12.3 Try some operations that are forbidden now
POST /test/doc/2 {"body": "foo"} POST /newindex {}
HTTP/1.1 403 Forbidden Content-Type: application/json; charset=UTF-8 Content-Length: 98 {"error":"ClusterBlockException[blocked by: [FORBIDDEN/6/cluster read-only (api)];]","status":403} HTTP/1.1 403 Forbidden Content-Type: application/json; charset=UTF-8 Content-Length: 98 {"error":"ClusterBlockException[blocked by: [FORBIDDEN/6/cluster read-only (api)];]","status":403}
4.12.4 Query
Queries still work, because they're read-only
POST /test/_search?pretty { "query": { "match_all": {} } }
{ "took" : 66, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source":{"body": "foo"} } ] } }
4.12.5 Undo the cluster block
And then put a new document, to prove it's undone
PUT /_cluster/settings { "transient": { // the cluster read be written to now "cluster.blocks.read_only": false } } POST /test/doc/2 {"body": "foo"}
{"acknowledged":true,"persistent":{},"transient":{"cluster":{"blocks":{"read_only":"false"}}}} {"_index":"test","_type":"doc","_id":"2","_version":1,"created":true}
4.13 Sorting with a script
Sometimes you may want to transform a field for sorting. NOTE a better way to
do this would be to use function_score
to score based on the values of the
strings, but this is to demonstrate doing it with sorting.
4.13.1 Create an index
DELETE /script-sort {} POST /script-sort { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.13.2 Index docs
POST /script-sort/doc/1 {"body": "foo"} POST /script-sort/doc/2 {"body": "bar"} POST /script-sort/doc/3 {"body": "baz"} POST /script-sort/_refresh {}
{"_index":"script-sort","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"script-sort","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"script-sort","_type":"doc","_id":"3","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.13.3 Query
POST /script-sort/_search?pretty { "query": { "match_all": {} }, "sort": [ { "_script" : { "script" : "meanings.get(doc['body'].value)", "type" : "number", "lang": "groovy", "params" : { "meanings" : { "foo": 2, "bar": 1, "baz": 3 } }, "order" : "asc" } } ] }
{ "took" : 667, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : null, "hits" : [ { "_index" : "script-sort", "_type" : "doc", "_id" : "2", "_score" : null, "_source":{"body": "bar"}, "sort" : [ 1.0 ] }, { "_index" : "script-sort", "_type" : "doc", "_id" : "1", "_score" : null, "_source":{"body": "foo"}, "sort" : [ 2.0 ] }, { "_index" : "script-sort", "_type" : "doc", "_id" : "3", "_score" : null, "_source":{"body": "baz"}, "sort" : [ 3.0 ] } ] } }
4.14 Doc-values with arrays of object fields
It should be possible to use doc_values for arrays of object fields according to Adrien
4.14.1 Create an index
Creating an index with doc_values used for each field of an object array
DELETE /dv {} POST /dv { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "person": { "type": "object", "properties": { "first": { "type": "string", "index": "not_analyzed", "doc_values": true }, "last": { "type": "string", "index": "not_analyzed", "doc_values": true } } } } } } }
{"acknowledged":true} {"acknowledged":true}
4.14.2 Index docs
POST /dv/doc/1 {"person": [ { "first": "John", "last": "Smith" }, { "first": "Sally", "last": "Bones" }, { "first": "John", "last": "Carter" } ] } POST /dv/_refresh {}
{"_index":"dv","_type":"doc","_id":"1","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.14.3 Query
POST /dv/_search?search_type=count&pretty { "query": { "match_all": {} }, "aggs": { "myfirstnames": { "terms": { "field": "person.first" } }, "mylastnames": { "terms": { "field": "person.last" } } } }
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "mylastnames" : { "buckets" : [ { "key" : "Bones", "doc_count" : 1 }, { "key" : "Carter", "doc_count" : 1 }, { "key" : "Smith", "doc_count" : 1 } ] }, "myfirstnames" : { "buckets" : [ { "key" : "John", "doc_count" : 1 }, { "key" : "Sally", "doc_count" : 1 } ] } } }
And to show there is no fielddata used:
GET /_nodes/stats/indices?fields=*&pretty {}
{ "memory_size_in_bytes": 0, "evictions": 0, "fields": {} }
4.15 Wikimedia's source_regex query equivalent
I wanted to see if it was possible to create an equivalent to Wikimedia's
source_regex
query plugin, so this is me trying to do it.
It's basically trigrams with a regex query rescore.
4.15.1 Create an index
DELETE /wm {} POST /wm { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "index": { "analysis": { "analyzer": { "trigram": { "type": "custom", "tokenizer": "trigram_t", "filter": ["lowercase"] } }, "tokenizer": { "trigram_t": { "type": "ngram", "min_gram": 3, "max_gram": 3 } } } } }, "mappings": { "doc": { "properties": { "body": { "type": "string", "analyzer": "trigram" } } } } }
{"acknowledged":true} {"acknowledged":true}
4.15.2 Index docs
POST /wm/doc/1 {"body": "I can has test"} POST /wm/doc/2 {"body": "I can't has cheezburger"} POST /wm/doc/3 {"body": "can I have some things?"} POST /wm/doc/4 {"body": "who art thou, to has such things?"} POST /wm/doc/5 {"body": "Can I have that?"} POST /wm/_refresh {}
{"_index":"wm","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"wm","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"wm","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"wm","_type":"doc","_id":"4","_version":1,"created":true} {"_index":"wm","_type":"doc","_id":"5","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.15.3 Query
I reuse the "i ca..has" because I am simulating a client querying using the same text in both places, but it could just as easily do it separately.
POST /wm/_search?pretty { "query": { "match": { "body": "i ca..has" } }, "rescore": { "window_size": 10, "query": { "rescore_query": { "regexp": { "body": "i ca..has" } } } } }
{ "took" : 8, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.112542905, "hits" : [ { "_index" : "wm", "_type" : "doc", "_id" : "1", "_score" : 0.112542905, "_source":{"body": "I can has test"} }, { "_index" : "wm", "_type" : "doc", "_id" : "2", "_score" : 0.08440718, "_source":{"body": "I can't has cheezburger"} }, { "_index" : "wm", "_type" : "doc", "_id" : "4", "_score" : 0.0057871966, "_source":{"body": "who art thou, to has such things?"} } ] } }
Drew asked how "i ca..has" would be analyzed, so:
curl -XPOST 'localhost:9200/wm/_analyze?analyzer=trigram&pretty' -d'i ca..has'
{ "tokens" : [ { "token" : "i c", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 1 }, { "token" : " ca", "start_offset" : 1, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "ca.", "start_offset" : 2, "end_offset" : 5, "type" : "word", "position" : 3 }, { "token" : "a..", "start_offset" : 3, "end_offset" : 6, "type" : "word", "position" : 4 }, { "token" : "..h", "start_offset" : 4, "end_offset" : 7, "type" : "word", "position" : 5 }, { "token" : ".ha", "start_offset" : 5, "end_offset" : 8, "type" : "word", "position" : 6 }, { "token" : "has", "start_offset" : 6, "end_offset" : 9, "type" : "word", "position" : 7 } ] }
4.16 String interpretation in Groovy Scripts
Since String.format()
is not in the whitelist, sometimes it's nice to be able
to use string interpretation in scripts. Groovy allows doing this with GString
interpretation.
4.16.1 Create an index
DELETE /script {} POST /script { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.16.2 Index docs
POST /script/doc/1?refresh {"body": "foo"} POST /script/doc/1/_update { "script": "ctx._source.body = ctx._source.body + \"${bar}\"", "params": { "bar": " hi" } } GET /script/doc/1?_source=body {}
{"_index":"script","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"script","_type":"doc","_id":"1","_version":2} {"_index":"script","_type":"doc","_id":"1","_version":2,"found":true,"_source":{"body":"foo hi"}}
4.17 Using BM25 or DFR instead of TF-IDF
While TF-IDF does a great job, sometimes people may want to use BM25, which is another nice similarity algorithm. This is an example of setting it up per-field so you can compare the two algorithms.
I did this with a multi-field that indexed the body
field with all the
different similarities, just so I could compare all at once. The interesting
thing about this is that when I spoke to Robert about this, there's nothing
that's actually being changed during indexing, it's just safety in case that
ever needs to be the case.
I'd like to make it configurable at query time, I think I have a branch for it somewhere…
4.17.1 Create an index
DELETE /sim {} POST /sim { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "similarity": { "custom_bm25": { "type": "BM25", // These are the default values "k1": 1.2, // how important term frequency is "b": 0.75 // how normalized field length should be } }, "analysis": { "analyzer": { "sim_analyzer": { "tokenizer": "standard", "filters": ["lowercase", "kstem", "stop"] } } } }, "mappings": { "doc": { "properties": { "body": { "type": "string", "fields": { "tfidf": { "type": "string", "similarity": "tfidf" }, "dfr": { "type": "string", "similarity": "dfr" }, "bm25": { "type": "string", // "BM25" could be used here to use the default values "similarity": "custom_bm25" } } } } } } }
{"acknowledged":true} {"acknowledged":true}
4.17.2 Index docs
POST /sim/doc/1 {"body": "A quick brown fox jumped over the lazy brown dog"} POST /sim/doc/2 {"body": "Fast jumping brown spiders"} POST /sim/doc/3 {"body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders"} POST /sim/_refresh {}
{"_index":"sim","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"sim","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"sim","_type":"doc","_id":"3","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.17.3 Query
Here I funnel all of the output through
jq ".hits.hits[]"
to output only the documents that match with their scores.
First with the traditional TF-IDF:
POST /sim/_search { "query": { "multi_match": { "query": "jumping brown dogs", "minimum_should_match": "30%", "fields": ["body.tfidf"] } } }
{ "_index": "sim", "_type": "doc", "_id": "2", "_score": 0.391954, "_source": { "body": "Fast jumping brown spiders" } } { "_index": "sim", "_type": "doc", "_id": "3", "_score": 0.26056325, "_source": { "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders" } } { "_index": "sim", "_type": "doc", "_id": "1", "_score": 0.03540124, "_source": { "body": "A quick brown fox jumped over the lazy brown dog" } }
Then with BM25:
POST /sim/_search { "query": { "multi_match": { "query": "jumping brown dogs", "minimum_should_match": "30%", "fields": ["body.bm25"] } } }
{ "_index": "sim", "_type": "doc", "_id": "2", "_score": 0.9845756, "_source": { "body": "Fast jumping brown spiders" } } { "_index": "sim", "_type": "doc", "_id": "3", "_score": 0.8407544, "_source": { "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders" } } { "_index": "sim", "_type": "doc", "_id": "1", "_score": 0.060791545, "_source": { "body": "A quick brown fox jumped over the lazy brown dog" } }
Finally with DFR (Divergence From Randomness):
POST /sim/_search { "query": { "multi_match": { "query": "jumping brown dogs", "minimum_should_match": "30%", "fields": ["body.dfr"] } } }
{ "_index": "sim", "_type": "doc", "_id": "2", "_score": 0.391954, "_source": { "body": "Fast jumping brown spiders" } } { "_index": "sim", "_type": "doc", "_id": "3", "_score": 0.26056325, "_source": { "body": "brown dogs jump over lazy spiders that are fast and sneaky. Those silly dogs and spiders" } } { "_index": "sim", "_type": "doc", "_id": "1", "_score": 0.03540124, "_source": { "body": "A quick brown fox jumped over the lazy brown dog" } }
4.18 Get the current time in a Groovy script
In MVEL, it used to be easy to use time()
to get the current time in a script,
however the Groovy script engine removed this. There is another way to get the
current time however.
4.18.1 Create an index and index a test document
DELETE /groovy {} POST /groovy { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "num": {"type": "long"} } } } } POST /groovy/doc/1?refresh {"num": 5}
{"acknowledged":true} {"acknowledged":true} {"_index":"groovy","_type":"doc","_id":"1","_version":1,"created":true}
4.18.2 Update using the DateTime object
POST /groovy/doc/1/_update { "script": { "script": "ctx._source.num = DateTime.now().getMillis()" } } GET /groovy/doc/1?pretty {}
{"_index":"groovy","_type":"doc","_id":"1","_version":2} { "_index" : "groovy", "_type" : "doc", "_id" : "1", "_version" : 2, "found" : true, "_source":{"num":1420625502499} }
4.19 Formatting strings in Groovy scripts
Sometimes it can be helpful to format strings using String.format
inside of a
Groovy script. In addition, Groovy has string interpolation of its own that you
can use.
4.19.1 Create an index and index a test document
DELETE /groovy {} POST /groovy { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } } POST /groovy/doc/1?refresh {"body": "foo"}
{"acknowledged":true} {"acknowledged":true} {"_index":"groovy","_type":"doc","_id":"1","_version":1,"created":true}
4.19.2 Update the document using string formatting
This currently doesn't work, because String.format
is not in the whitelist. I
should probably add it, see <github issue here>.
POST /groovy/doc/1/_update { "script": { "script": "ctx._source.body = String.format(\"%s: %d\", a, b)", "params": { "a": "bar", "b": 5 } } } GET /groovy/doc/1 {} // Not as powerful, because you can't specify things like decimal format, but // still usable POST /groovy/doc/1/_update { "script": { "script": "ctx._source.body = \"${a}: ${b}\"", "params": { "a": "bar", "b": 5 } } } GET /groovy/doc/1 {}
HTTP/1.1 400 Bad Request Content-Type: application/json; charset=UTF-8 Content-Length: 3079 {"error":"ElasticsearchIllegalArgumentException[failed to execute script]; nested: GroovyScriptCompilationException[MultipleCompilationErrorsException[startup failed:\nGeneral error during canonicalization: Method calls not allowed on [java.lang.String]\n\njava.lang.SecurityException: Method calls not allowed on [java.lang.String]\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitMethodCallExpression(SecureASTCustomizer.java:855)\n\tat org.codehaus.groovy.ast.expr.MethodCallExpression.visit(MethodCallExpression.java:64)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitBinaryExpression(SecureASTCustomizer.java:897)\n\tat org.codehaus.groovy.ast.expr.BinaryExpression.visit(BinaryExpression.java:49)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitExpressionStatement(SecureASTCustomizer.java:777)\n\tat org.codehaus.groovy.ast.stmt.ExpressionStatement.visit(ExpressionStatement.java:40)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer$SecuringCodeVisitor.visitBlockStatement(SecureASTCustomizer.java:737)\n\tat org.codehaus.groovy.ast.stmt.BlockStatement.visit(BlockStatement.java:69)\n\tat org.codehaus.groovy.control.customizers.SecureASTCustomizer.call(SecureASTCustomizer.java:552)\n\tat org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1047)\n\tat org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:583)\n\tat org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:561)\n\tat org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:538)\n\tat groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:286)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:259)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:245)\n\tat groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:203)\n\tat org.elasticsearch.script.groovy.GroovyScriptEngineService.compile(GroovyScriptEngineService.java:119)\n\tat org.elasticsearch.script.ScriptService.getCompiledScript(ScriptService.java:353)\n\tat org.elasticsearch.script.ScriptService.compile(ScriptService.java:339)\n\tat org.elasticsearch.script.ScriptService.executable(ScriptService.java:463)\n\tat org.elasticsearch.action.update.UpdateHelper.prepare(UpdateHelper.java:183)\n\tat org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:176)\n\tat org.elasticsearch.action.update.TransportUpdateAction.shardOperation(TransportUpdateAction.java:170)\n\tat org.elasticsearch.action.support.single.instance.TransportInstanceSingleOperationAction$AsyncSingleAction$1.run(TransportInstanceSingleOperationAction.java:187)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat java.lang.Thread.run(Thread.java:745)\n\n1 error\n]]; ","status":400} {"_index":"groovy","_type":"doc","_id":"1","_version":2,"found":true,"_source":{"body":"bar: 5"}} {"_index":"groovy","_type":"doc","_id":"1","_version":3} {"_index":"groovy","_type":"doc","_id":"1","_version":3,"found":true,"_source":{"body":"bar: 5"}}
4.20 Does highlighting work with ngrams?
So talking to Ryan about removing support for the _analyzer
field in the
mapping. Ngrams is the viable alternative, but does it work with highlighting?
Short answer, yes it does work. Don't use _analyzer
as we're probably going to
remove it soon.
4.20.1 Create an index
DELETE /ngrams {} POST /ngrams { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "analyzer": { "my_ngram": { "type": "custom", "tokenizer": "whitespace", "filter": ["lowercase", "ngram_tf"] } }, "filter": { "ngram_tf": { "type": "ngram", "min_gram": 2, "max_gram": 4 } } } }, "mappings": { "doc": { "_analyzer": { "path": "custom_analyzer" }, "properties": { "body": { "type": "string", "fields": { "ngram": { "type": "string", "analyzer": "my_ngram" }, "ngram_postings": { "type": "string", "analyzer": "my_ngram", "index_options": "offsets" }, "ngram_fvh": { "type": "string", "analyzer": "my_ngram", "term_vector": "with_positions_offsets" }, "french_postings": { "type": "string", "index_options": "offsets" }, "french_fvh": { "type": "string", "term_vector": "with_positions_offsets" } } } } } } }
{"acknowledged":true} {"acknowledged":true}
4.20.2 Index docs
POST /ngrams/doc/1?refresh { "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi", "custom_analyzer": "french" }
{"_index":"ngrams","_type":"doc","_id":"1","_version":1,"created":true}
4.20.3 Query with highlighting
POST /ngrams/_search?pretty { "query": { "match_phrase": { "body": { "query": "musée Louvre", "analyzer": "french", "slop": 1 } } }, "highlight": { "fields": { "body": {}, "body.french_postings": {}, "body.french_fvh": {}, "body.ngram": {}, "body.ngram_postings": {}, "body.ngram_fvh": {} } } } POST /ngrams/_search?pretty { "query": { "match_phrase": { "body.ngram": { "query": "musée Louvre", "analyzer": "my_ngram", "slop": 1 } } }, "highlight": { "fields": { "body": {}, "body.french_postings": {}, "body.french_fvh": {}, "body.ngram": {}, "body.ngram_postings": {}, "body.ngram_fvh": {} } } }
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.16273327, "hits" : [ { "_index" : "ngrams", "_type" : "doc", "_id" : "1", "_score" : 0.16273327, "_source":{ "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi", "custom_analyzer": "french" }, "highlight" : { "body.french_fvh" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ], "body.french_postings" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ], "body.ngram" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ], "body" : [ "Le <em>musée</em> du <em>Louvre</em> est ouvert tous les jours sauf le mardi" ] } } ] } } { "took" : 5, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.9730363, "hits" : [ { "_index" : "ngrams", "_type" : "doc", "_id" : "1", "_score" : 1.9730363, "_source":{ "body": "Le musée du Louvre est ouvert tous les jours sauf le mardi", "custom_analyzer": "french" }, "highlight" : { "body.ngram_fvh" : [ "Le <em>musée</em> du <em>Louvre</em> est <em>ouvert</em> <em>tous</em> les <em>jours</em> sauf le mardi" ], "body.ngram_postings" : [ "Le <em>musée</em> du <em>Louvre</em> est <em>ouvert</em> <em>tous</em> les <em>jours</em> sauf le mardi" ] } } ] } }
4.21 Filter aggregations do not load field data
Using a field in a filter in aggregations does not load field data:
4.21.1 Create an index
DELETE /filteragg {} POST /filteragg { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "field1": {"type": "string", "index": "not_analyzed"}, "field2": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.21.2 Index docs
POST /filteragg/doc/1 {"field1": "foo bar baz", "field2": "foo bar baz"} POST /filteragg/doc/2 {"field1": "foo eggplant potato", "field2": "foo eggplant potato"} POST /filteragg/_refresh {}
{"_index":"filteragg","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"filteragg","_type":"doc","_id":"2","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.21.3 Query
POST /filteragg/_search?pretty { "query": { "match_all": {} }, "size": 0, "aggs": { "one": { "aggs": { "myterms": { "terms": { "field": "field1" } } }, "filter": { "query": { "query_string": { "query": "field2:foo" } } } } } }
{ "doc_count": 2, "myterms": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "foo bar baz", "doc_count": 1 }, { "key": "foo eggplant potato", "doc_count": 1 } ] } }
4.22 Determining why a shard will not be allocated
So, suppose you create an index but you can't figure out why shards won't allocate. There are a couple of ways to diagnose this like turning the logging level up, etc. However, you can use the reroute API to give a nice explanation as well:
4.22.1 Create an index that cannot be allocated
Here because there is 1 replica, it will not be able to be allocated
DELETE /disktest {} POST /disktest { "settings": { "number_of_shards": 1, "number_of_replicas": 1 } }
{"acknowledged":true} {"acknowledged":true}
4.22.2 Explaining why a shard cannot be allocated
We can see why shards would not be allocated, which is because a replica cannot be allocated on the same node that the primary is allocated on:
POST /_cluster/reroute?dry_run&explain&pretty { "commands": [ { "allocate": { "index": "disktest", "shard": 0, "node": "NiifNsi5QNObqfgG4i2PCA" } } ] }
{ "acknowledged" : true, "state" : { "version" : 30, "master_node" : "NiifNsi5QNObqfgG4i2PCA", "blocks" : { }, "nodes" : { "NiifNsi5QNObqfgG4i2PCA" : { "name" : "Bobster", "transport_address" : "inet[/192.168.0.4:9300]", "attributes" : { } } }, "routing_table" : { "indices" : { "disktest" : { "shards" : { "0" : [ { "state" : "STARTED", "primary" : true, "node" : "NiifNsi5QNObqfgG4i2PCA", "relocating_node" : null, "shard" : 0, "index" : "disktest" }, { "state" : "UNASSIGNED", "primary" : false, "node" : null, "relocating_node" : null, "shard" : 0, "index" : "disktest" } ] } } } }, "routing_nodes" : { "unassigned" : [ { "state" : "UNASSIGNED", "primary" : false, "node" : null, "relocating_node" : null, "shard" : 0, "index" : "disktest" } ], "nodes" : { "NiifNsi5QNObqfgG4i2PCA" : [ { "state" : "STARTED", "primary" : true, "node" : "NiifNsi5QNObqfgG4i2PCA", "relocating_node" : null, "shard" : 0, "index" : "disktest" } ] } }, "allocations" : [ ] }, "explanations" : [ { "command" : "allocate", "parameters" : { "index" : "disktest", "shard" : 0, "node" : "NiifNsi5QNObqfgG4i2PCA", "allow_primary" : false }, "decisions" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "shard cannot be allocated on same node [NiifNsi5QNObqfgG4i2PCA] it already exists on" }, { "decider" : "filter", "decision" : "YES", "explanation" : "node passes include/exclude/require filters" }, { "decider" : "replica_after_primary_active", "decision" : "YES", "explanation" : "primary is already active" }, { "decider" : "throttling", "decision" : "YES", "explanation" : "below shard recovery limit of [2]" }, { "decider" : "enable", "decision" : "YES", "explanation" : "allocation disabling is ignored" }, { "decider" : "disable", "decision" : "YES", "explanation" : "allocation disabling is ignored" }, { "decider" : "awareness", "decision" : "YES", "explanation" : "no allocation awareness enabled" }, { "decider" : "shards_limit", "decision" : "YES", "explanation" : "total shard limit disabled: [-1] <= 0" }, { "decider" : "node_version", "decision" : "YES", "explanation" : "target node version [1.4.3] is same or newer than source node version [1.4.3]" }, { "decider" : "disk_threshold", "decision" : "YES", "explanation" : "only a single node is present" }, { "decider" : "snapshot_in_progress", "decision" : "YES", "explanation" : "shard not primary or relocation disabled" } ] } ] }
4.23 Returning the scores matching documents in a scroll request
Sometimes you may want to issue a scroll, but you still want to return what the
actual score for each document is. The Scroll API provides a way to do that
using track_scores
.
4.23.1 Create an index
DELETE /sctest {} POST /sctest { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.23.2 Index some documents
POST /sctest/doc/1 {"body": "foo"} POST /sctest/doc/2 {"body": "foo bar foo baz"} POST /sctest/doc/3 {"body": "fooaloo"} POST /sctest/doc/4?refresh {"body": "foo foo foo foo foo"}
{"_index":"sctest","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"sctest","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"sctest","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"sctest","_type":"doc","_id":"4","_version":1,"created":true}
4.23.3 Query
Instead of a regular query, we will perform a scan/scroll query over all of the
results. I use track_scores: true
here because without it Elasticsearch will
not compute the score of each result.
POST /sctest/_search?scroll=1m&search_type=scan&pretty { "query": { "match": { "body": "foo" } }, "track_scores": true }
{ "_scroll_id" : "c2NhbjsxOzM2OmxQNEU4Mi1JVGphME1vbGR1SkJSMGc7MTt0b3RhbF9oaXRzOjM7", "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.0, "hits" : [ ] } }
And then using the scroll id from the previous response, we can see the
documents and their scores (which would usually be 0.0
if track_scores
were
not set)
curl -XGET 'localhost:9200/_search/scroll?scroll=1m&pretty' -d'c2NhbjsxOzM2OmxQNEU4Mi1JVGphME1vbGR1SkJSMGc7MTt0b3RhbF9oaXRzOjM7'
{ "_scroll_id" : "c2NhbjswOzE7dG90YWxfaGl0czozOw==", "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 0.0, "hits" : [ { "_index" : "sctest", "_type" : "doc", "_id" : "1", "_score" : 1.0, "_source":{"body": "foo"} }, { "_index" : "sctest", "_type" : "doc", "_id" : "2", "_score" : 0.70710677, "_source":{"body": "foo bar foo baz"} }, { "_index" : "sctest", "_type" : "doc", "_id" : "4", "_score" : 0.97827977, "_source":{"body": "foo foo foo foo foo"} } ] } }
4.24 Inner hits example
Here's an example of nested doc type and inner hits, to retrieve the inner document that matched the nested query instead of the entire surrounding document.
4.24.1 Create an index
DELETE /inner {} POST /inner { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "task": { "type": "nested", "properties": { "name": { "type": "string" } } } } } } }
{"acknowledged":true} {"acknowledged":true}
4.24.2 Index docs
POST /inner/doc/1 { "task": [ { "name": "foo" }, { "name": "bar" } ] } POST /inner/_refresh {}
{"_index":"inner","_type":"doc","_id":"1","_version":1,"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.24.3 Query
POST /inner/_search?pretty { "query": { "nested": { "path": "task", "query": { "match": { "name": "foo" } }, "inner_hits": {} } } }
{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.4054651, "hits" : [ { "_index" : "inner", "_type" : "doc", "_id" : "1", "_score" : 1.4054651, "_source":{ "task": [ { "name": "foo" }, { "name": "bar" } ] }, "inner_hits" : { "task" : { "hits" : { "total" : 1, "max_score" : 1.4054651, "hits" : [ { "_index" : "inner", "_type" : "doc", "_id" : "1", "_nested" : { "field" : "task", "offset" : 0 }, "_score" : 1.4054651, "_source":{"name":"foo"} } ] } } } } ] } }
4.25 Does setting an analyzer and not_analyzed make ES unhappy?
ES doesn't care that you said "not_analyzed" and "standard analyzer"
DELETE /analyzer-test {} POST /analyzer-test { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "name": { "type": "string", "index": "not_analyzed", "analyzer": "standard" } } } } }
{"acknowledged":true} {"acknowledged":true}
4.26 Removing norms from the _all
field dynamically
So, technically you should be able to remove norms from the _all
field dynamically.
4.26.1 Create an index
DELETE /ntest {} POST /ntest { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "_all": { "enabled": true }, "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.26.2 Index docs
POST /ntest/doc/1?refresh {"body": "foo bar baz"} POST /ntest/doc/2?refresh {"body": "bar baz eggplant"} POST /ntest/doc/3?refresh {"body": "baz"} POST /ntest/doc/4?refresh {"body": "eggplant"}
{"_index":"ntest","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"ntest","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"ntest","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"ntest","_type":"doc","_id":"4","_version":1,"created":true}
Check the segments:
GET /ntest/_segments?pretty {}
{ "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "indices" : { "ntest" : { "shards" : { "0" : [ { "routing" : { "state" : "STARTED", "primary" : true, "node" : "LiJ8aXvGSRuNhwcT5Pflgg" }, "num_committed_segments" : 0, "num_search_segments" : 4, "segments" : { "_0" : { "generation" : 0, "num_docs" : 1, "deleted_docs" : 0, "size_in_bytes" : 2337, "memory_in_bytes" : 3298, "committed" : false, "search" : true, "version" : "4.10.4", "compound" : true }, "_1" : { "generation" : 1, "num_docs" : 1, "deleted_docs" : 0, "size_in_bytes" : 2362, "memory_in_bytes" : 3298, "committed" : false, "search" : true, "version" : "4.10.4", "compound" : true }, "_2" : { "generation" : 2, "num_docs" : 1, "deleted_docs" : 0, "size_in_bytes" : 2289, "memory_in_bytes" : 3298, "committed" : false, "search" : true, "version" : "4.10.4", "compound" : true }, "_3" : { "generation" : 3, "num_docs" : 1, "deleted_docs" : 0, "size_in_bytes" : 2324, "memory_in_bytes" : 3298, "committed" : false, "search" : true, "version" : "4.10.4", "compound" : true } } } ] } } } }
4.26.3 Query
GET /ntest/_search?pretty { "query": { "match": { "_all": "eggplant" } }, "explain": true }
[ { "_explanation": { "details": [ { "details": [ { "details": [ { "description": "termFreq=1.0", "value": 1 } ], "description": "tf(freq=1.0), with freq of:", "value": 1 }, { "description": "idf(docFreq=2, maxDocs=4)", "value": 1.287682 }, { "description": "fieldNorm(doc=0)", "value": 1 } ], "description": "fieldWeight in 0, product of:", "value": 1.287682 } ], "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:", "value": 1.287682 }, "_source": { "body": "eggplant" }, "_score": 1.287682, "_id": "4", "_type": "doc", "_index": "ntest", "_node": "LiJ8aXvGSRuNhwcT5Pflgg", "_shard": 0 }, { "_explanation": { "details": [ { "details": [ { "details": [ { "description": "termFreq=1.0", "value": 1 } ], "description": "tf(freq=1.0), with freq of:", "value": 1 }, { "description": "idf(docFreq=2, maxDocs=4)", "value": 1.287682 }, { "description": "fieldNorm(doc=0)", "value": 0.5 } ], "description": "fieldWeight in 0, product of:", "value": 0.643841 } ], "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:", "value": 0.643841 }, "_source": { "body": "bar baz eggplant" }, "_score": 0.643841, "_id": "2", "_type": "doc", "_index": "ntest", "_node": "LiJ8aXvGSRuNhwcT5Pflgg", "_shard": 0 } ]
4.26.4 Update the norms mapping
PUT /ntest/_mapping/doc { "_all": { "enabled": true, "norms": { "enabled": false } } } GET /ntest/_mapping?pretty {}
{"acknowledged":true} { "ntest" : { "mappings" : { "doc" : { "_all" : { "enabled" : true, "omit_norms" : true }, "properties" : { "body" : { "type" : "string" } } } } } }
Then force merge:
POST /ntest/_optimize?max_num_segments=1 {}
{"_shards":{"total":1,"successful":1,"failed":0}}
4.26.5 Search again
GET /ntest/_search?pretty { "query": { "match": { "_all": "eggplant" } }, "explain": true }
[ { "_explanation": { "details": [ { "details": [ { "details": [ { "description": "termFreq=1.0", "value": 1 } ], "description": "tf(freq=1.0), with freq of:", "value": 1 }, { "description": "idf(docFreq=2, maxDocs=4)", "value": 1.287682 }, { "description": "fieldNorm(doc=2)", "value": 1 } ], "description": "fieldWeight in 2, product of:", "value": 1.287682 } ], "description": "weight(_all:eggplant in 2) [PerFieldSimilarity], result of:", "value": 1.287682 }, "_source": { "body": "eggplant" }, "_score": 1.287682, "_id": "4", "_type": "doc", "_index": "ntest", "_node": "LiJ8aXvGSRuNhwcT5Pflgg", "_shard": 0 }, { "_explanation": { "details": [ { "details": [ { "details": [ { "description": "termFreq=1.0", "value": 1 } ], "description": "tf(freq=1.0), with freq of:", "value": 1 }, { "description": "idf(docFreq=2, maxDocs=4)", "value": 1.287682 }, { "description": "fieldNorm(doc=0)", "value": 0.5 } ], "description": "fieldWeight in 0, product of:", "value": 0.643841 } ], "description": "weight(_all:eggplant in 0) [PerFieldSimilarity], result of:", "value": 0.643841 }, "_source": { "body": "bar baz eggplant" }, "_score": 0.643841, "_id": "2", "_type": "doc", "_index": "ntest", "_node": "LiJ8aXvGSRuNhwcT5Pflgg", "_shard": 0 } ]
4.27 Combining scores from BM25 and TF-IDF indices
With Lucene 6.0, BM25 will be the default similarity, so I'm curious how the scores will combine between older (TF-IDF) indices versus newer BM25 indices.
4.27.1 Create a couple of indices
DELETE /tfidf,bm25 {} POST /tfidf { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": { "type": "string" } } } } } POST /bm25 { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": { "type": "string", "similarity": "BM25" } } } } }
{"acknowledged":true} {"acknowledged":true} {"acknowledged":true}
4.27.2 Index the same documents into each index
POST /tfidf/doc/1 {"body": "foo"} POST /tfidf/doc/2 {"body": "foo bar"} POST /tfidf/doc/3 {"body": "foo bar baz"} POST /bm25/doc/1 {"body": "foo"} POST /bm25/doc/2 {"body": "foo bar"} POST /bm25/doc/3 {"body": "foo bar baz"} POST /tfidf,bm25/_refresh {}
{"_index":"tfidf","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"tfidf","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"tfidf","_type":"doc","_id":"3","_version":1,"created":true} {"_index":"bm25","_type":"doc","_id":"1","_version":1,"created":true} {"_index":"bm25","_type":"doc","_id":"2","_version":1,"created":true} {"_index":"bm25","_type":"doc","_id":"3","_version":1,"created":true} {"_shards":{"total":2,"successful":2,"failed":0}}
4.27.3 Perform the query
Scores are NOT normalized between the BM25 and TF-IDF indices.
POST /tfidf,bm25/_search?pretty { "query": { "match": { "body": "foo" } } // , "explain": true }
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "hits" : { "total" : 6, "max_score" : 0.71231794, "hits" : [ { "_index" : "tfidf", "_type" : "doc", "_id" : "1", "_score" : 0.71231794, "_source":{"body": "foo"} }, { "_index" : "tfidf", "_type" : "doc", "_id" : "2", "_score" : 0.4451987, "_source":{"body": "foo bar"} }, { "_index" : "tfidf", "_type" : "doc", "_id" : "3", "_score" : 0.35615897, "_source":{"body": "foo bar baz"} }, { "_index" : "bm25", "_type" : "doc", "_id" : "1", "_score" : 0.16786803, "_source":{"body": "foo"} }, { "_index" : "bm25", "_type" : "doc", "_id" : "2", "_score" : 0.11980793, "_source":{"body": "foo bar"} }, { "_index" : "bm25", "_type" : "doc", "_id" : "3", "_score" : 0.09476421, "_source":{"body": "foo bar baz"} } ] } }
4.28 Searching with a slop phrase has a higher score for adjacent terms
Basically, if you have two documents that match, the one with the lower slop should have a higher score than the one that matches with slop.
4.28.1 Create an index
DELETE /sloptest {} POST /sloptest { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "doc": { "properties": { "body": {"type": "string"} } } } }
{"acknowledged":true} {"acknowledged":true}
4.28.2 Index docs
// foo and baz have a slop-match of 1 POST /sloptest/doc/1 {"body": "foo bar baz"} // foo and baz have a slop-match of 0 POST /sloptest/doc/2 {"body": "foo baz bar"} POST /sloptest/_refresh {}
{"_index":"sloptest","_type":"doc","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"sloptest","_type":"doc","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.28.3 Query
Notice the score difference:
POST /sloptest/_search?pretty { "query": { "match": { "body": { "type": "phrase", "query": "foo baz", "slop": 1 } } } }
[ { "_source": { "body": "foo baz bar" }, "_score": 0.5945348, "_id": "2", "_type": "doc", "_index": "sloptest" }, { "_source": { "body": "foo bar baz" }, "_score": 0.4203996, "_id": "1", "_type": "doc", "_index": "sloptest" } ]
4.29 Circular parent-child references from Grandparent to Grandchild
Someone asked at a training whether it was possible to have circular parent-child relationships. With a single parent/child relationship this is expressly disabled, but if you go three levels deep…
4.29.1 Create an index with a grandparent, parent, and child
DELETE /test {} POST /test { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } }, "mappings": { "foo": { "_parent": { "type": "baz" }, "properties": { "body": { "type": "string" } } }, "bar": { "_parent": { "type": "foo" }, "properties": { "body": { "type": "string" } } }, "baz": { "_parent": { "type": "bar" }, "properties": { "body": { "type": "string" } } } } }
{"acknowledged":true} {"acknowledged":true}
4.29.2 Index docs
Index two documents per type with references to their parents/grandparents.
POST /test/foo/1?parent=1 {"body": "cat"} POST /test/foo/2?parent=2 {"body": "dog"} POST /test/bar/1?parent=1 {"body": "pig"} POST /test/bar/2?parent=2 {"body": "llama"} POST /test/baz/1?parent=1 {"body": "duck"} POST /test/baz/2?parent=2 {"body": "emu"} POST /test/_refresh {}
{"_index":"test","_type":"foo","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"test","_type":"foo","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"test","_type":"bar","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"test","_type":"bar","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"test","_type":"baz","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"test","_type":"baz","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_shards":{"total":1,"successful":1,"failed":0}}
4.29.3 Do some really complicated parent/child querying
So, the circular parent/child reference actually works. I don't recommend that you actually do this in practice though.
GET /test/foo/_search?pretty { "query": { "bool": { "must": [ {"match": {"body": "cat"}}, { "has_child": { "type": "bar", "query": { "bool": { "must": [ {"match": {"body": "pig"}}, { "has_child": { "type": "baz", "query": { "bool": { "must": [ {"match": {"body": "duck"}}, { "has_child": { "type": "foo", "query": { "match": { "body": "cat" } } } } ] } } } } ] } } } } ] } } }
{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 2.3246877, "hits" : [ { "_index" : "test", "_type" : "foo", "_id" : "1", "_score" : 2.3246877, "_routing" : "1", "_parent" : "1", "_source":{"body": "cat"} } ] } }
4.30 Geo distance sorting
4.30.1 Create an index with a mapping that uses a geo_point field
DELETE /myindex {} POST /myindex { "settings": { "index": { "number_of_shards": 1, "number_of_replicas": 0 } }, "mappings": { "doc": { "properties": { "body": { "type": "string" }, "location": { "type": "geo_point" } } } } }
{"acknowledged":true} {"acknowledged":true}
4.30.2 Index a few documents with geo distance
POST /myindex/doc/1 { "body": "mexican food", "location": { "lat": 41.12, "lon": -71.34 } } POST /myindex/doc/2 { "body": "chinese food", "location": { "lat": 39.01, "lon": -75.00 } } POST /myindex/doc/3 { "body": "dutch food", "location": { "lat": 25.12, "lon": -31.00 } }
{"_index":"myindex","_type":"doc","_id":"1","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"myindex","_type":"doc","_id":"2","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true} {"_index":"myindex","_type":"doc","_id":"3","_version":1,"_shards":{"total":1,"successful":1,"failed":0},"created":true}
4.30.3 Perform the query with geo_distance sorting
POST /myindex/_search?pretty { "query": { "match": { "body": "food" } }, "sort": [ { "_geo_distance": { "location": { "lat": 40, "lon": -70 }, "order": "asc", "unit": "km" } }, "_score" ] }
{ "took" : 27, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : null, "hits" : [ { "_index" : "myindex", "_type" : "doc", "_id" : "1", "_score" : 0.4451987, "_source" : { "body" : "mexican food", "location" : { "lat" : 41.12, "lon" : -71.34 } }, "sort" : [ 168.24429169579741, 0.4451987 ] }, { "_index" : "myindex", "_type" : "doc", "_id" : "2", "_score" : 0.4451987, "_source" : { "body" : "chinese food", "location" : { "lat" : 39.01, "lon" : -75.0 } }, "sort" : [ 442.7024334265092, 0.4451987 ] }, { "_index" : "myindex", "_type" : "doc", "_id" : "3", "_score" : 0.4451987, "_source" : { "body" : "dutch food", "location" : { "lat" : 25.12, "lon" : -31.0 } }, "sort" : [ 3972.3297497833664, 0.4451987 ] } ] } }
5 Logstash Examples
A miscellaneous gathering of Logstash configurations that do various things
5.1 Split log files into separate files by time or node
When a customer gives 10 different elasticsearch log files, it can be useful to separate them all into multi-node logs separate by hour, or take a huge log and separate it into node-specific files.
This logstash config does that, which I find quite useful for correlating logs for the time of events for multiple nodes.
input { stdin {} } filter { multiline { # message starting with [ must be the next one, works until 2099 pattern => "^\[20" negate => "true" what => "previous" } grok { # also do inline trimming match => [ "message", "\[\s*%{DATA:date}\s*\]\[\s*%{DATA:loglevel}\s*\]\[%{DATA:class}\s*\] %{GREEDYDATA:logline}" ] } date { match => [ "date", "YYYY-MM-dd HH:mm:ss,SSS" ] timezone => "UTC" } # the logline starting with "[" might be a nodename if [logline] =~ /^\[/ { grok { match => [ "logline", "\[%{DATA:node}\] %{GREEDYDATA}" ] } } } output { stdout { codec => dots } # Output log files as hourly log files file { path => "es-%{+YYYY-MM-dd.HH}:00.log" message_format => "%{message}" } # Split a log by nodes # if [node] { # file { # path => "es-%{node}-%{+YYYY-MM-dd}.log" # message_format => "%{message}" # } # } else { # file { # path => "es-NONE-%{+YYYY-MM-dd}.log" # message_format => "%{message}" # } # } }
You can change whether to dump them by time only or split them by node by commenting and uncommenting the different outputs.
It can then be used like the following:
$ ls *.log log1.log log2.log log3.log $ cat *.log | bin/logstash -f split.conf
Where split.conf
is the configuration file above. It will produce new log
files.
5.2 Capture edits of Wikipedia pages from IRC
Inputs don't always have to be files
input { irc { type => 'wikipedia' host => 'irc.wikimedia.org' nick => 'logstash-wikipedia' # change this to whatever you want... de.wikipedia for the german wikipedia, etc channels => ['#en.wikipedia'] } } filter { # remove some weird encoding stuff from IRC mutate { gsub => [ "message", "\u000302", "", "message", "\u000303", "", "message", "\u000307", "", "message", "\u000310", "", "message", "\u000314", "", "message", "\u00034", "", "message", "\u00035", "", "message", "\u0003", "" ] } # extract page and user grok { match => [ "message", "\[\[%{GREEDYDATA:page}\]\]%{GREEDYDATA} \* %{GREEDYDATA:user} \* %{GREEDYDATA}" ] } } output { stdout { codec => 'rubydebug' } elasticsearch { protocol => 'http' host => 'localhost' index => 'wikipedia-edits' } }
Run with bin/logstash -f wiki-edits.conf
and watch the edits roll in!