Shadow Replica Demo

Shadow replica demo

Shadow replica demo

Note that to follow along, this is required to be in your elasticsearch.yml. Then start a 3-node cluster listening on localhost:9200.

# don't append 0/1/2 to the data path for ES instances on the same machine
node.add_id_to_custom_path: false
# enable use of the `index.data_path` setting
node.enable_custom_paths: true

Don't forget to start with a clean slate, clean up the regular data directory as well as the /tmp/foo directory, since that will be where the new index is stored.

rm -rf ~/src/elasticsearch/data /tmp/foo

Index creation

First, create an index with a custom data path and shadow_replicas: true

// Make sure we have 3 nodes
GET /_cluster/health?wait_for_nodes=3
{}

POST /myindex
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 2,
    "data_path": "/tmp/foo",
    "shadow_replicas": true
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "string"
        }
      }
    }
  }
}

GET /_cluster/health?wait_for_status=green
{}

{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}
{"acknowledged":true}
{"cluster_name":"elasticsearch","status":"green","timed_out":false,"number_of_nodes":3,"number_of_data_nodes":3,"active_primary_shards":1,"active_shards":3,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0}

Shards show up just like regular shards:

GET /_cat/shards
{}

index   shard prirep state   docs store ip            node      
myindex 0     p      STARTED    0   85b 192.168.43.91 Lynx      
myindex 0     r      STARTED    0   85b 192.168.43.91 Warwolves 
myindex 0     r      STARTED    0   85b 192.168.43.91 Mop Man

Let's index a few documents and see how they show up on the various primary and replica shards.

Indexing some documents

Index two documents, flush, then index two more documents:

POST /myindex/doc
{"body": "This is document 1"}

POST /myindex/doc
{"body": "I'm document number 2"}

// Flush after indexing the first two documents
POST /myindex/_flush
{}

POST /myindex/doc
{"body": "Document 3 here"}

POST /myindex/doc
{"body": "Finally document 4"}

// Force a refresh
POST /myindex/_refresh
{}

// show the status of the shards
GET /_cat/shards
{}

{"_index":"myindex","_type":"doc","_id":"AUudq2kUeqHjI4Syvcj_","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUudq2lyeqHjI4SyvckA","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_shards":{"total":3,"successful":3,"failed":0}}
{"_index":"myindex","_type":"doc","_id":"AUudq2n2eqHjI4SyvckB","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUudq2oZeqHjI4SyvckC","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_shards":{"total":3,"successful":3,"failed":0}}
index   shard prirep state   docs store ip            node      
myindex 0     p      STARTED    4 5.8kb 192.168.43.91 Lynx      
myindex 0     r      STARTED    2 5.8kb 192.168.43.91 Warwolves 
myindex 0     r      STARTED    2 5.8kb 192.168.43.91 Mop Man

Looking at segments

Notice how the replicas only have the first two documents, because the segments created during the 3rd and 4th documents have not been fsyncd and committed yet. We can confirm this by checking out the /myindex/_segments info

GET /myindex/_segments?pretty
{}

{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  },
  "_1": {
    "generation": 1,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2910,
    "memory_in_bytes": 2388,
    "committed": false,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}
{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}
{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}

One shard has the _0 and _1 segments, where only _0 is committed, whereas the two nodes with the replicas show only the _0 (committed) segments.

Flushing to make documents visible

Now we can flush again to make the other two documents visible. Note that I refresh manually here, because the replica might finish flushing and refreshing before the primary does, so the segments wouldn't yet be visible.

POST /myindex/_flush
{}

POST /myindex/_refresh
{}

GET /_cat/shards
{}

{"_shards":{"total":3,"successful":3,"failed":0}}
{"_shards":{"total":3,"successful":3,"failed":0}}
index   shard prirep state   docs store ip            node      
myindex 0     p      STARTED    4 5.9kb 192.168.43.91 Lynx      
myindex 0     r      STARTED    4 5.9kb 192.168.43.91 Warwolves 
myindex 0     r      STARTED    4 5.9kb 192.168.43.91 Mop Man

Now we see all the segments show up to all nodes:

GET /myindex/_segments?pretty
{}

{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  },
  "_1": {
    "generation": 1,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2910,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}
{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  },
  "_1": {
    "generation": 1,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2910,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}
{
  "_0": {
    "generation": 0,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2949,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  },
  "_1": {
    "generation": 1,
    "num_docs": 2,
    "deleted_docs": 0,
    "size_in_bytes": 2910,
    "memory_in_bytes": 2388,
    "committed": true,
    "search": true,
    "version": "5.1.0",
    "compound": true
  }
}

Failover when a node dies

Before stopping any node, let's index a couple more documents that will exist only in the translog:

POST /myindex/doc
{"body": "document number 5 in the translog"}

POST /myindex/doc
{"body": "another document (number 6) in the translog"}

POST /myindex/_refresh
{}

GET /_cat/shards
{}

{"_index":"myindex","_type":"doc","_id":"AUudq9PVeqHjI4SyvckD","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUudq9P2eqHjI4SyvckE","_version":1,"_shards":{"total":3,"successful":3,"failed":0},"created":true}
{"_shards":{"total":3,"successful":3,"failed":0}}
index   shard prirep state   docs store ip            node      
myindex 0     p      STARTED    6 8.8kb 192.168.43.91 Lynx      
myindex 0     r      STARTED    4 8.8kb 192.168.43.91 Warwolves 
myindex 0     r      STARTED    4 8.8kb 192.168.43.91 Mop Man

Note that no flushing after indexing these documents

<< kill an ES node >>

With the node that has the primary gone, one of the shadow replicas has replaced it and replayed the translog that is on the shared filesystem:

GET /_cat/shards
{}

index   shard prirep state      docs store ip            node      
myindex 0     r      STARTED       6 8.9kb 192.168.43.91 Warwolves 
myindex 0     p      STARTED       6 8.9kb 192.168.43.91 Mop Man   
myindex 0     r      UNASSIGNED

And all of the documents are there:

GET /myindex/_search?pretty
{
  "query": {
    "match_all": {}
  }
}

"This is document 1"
"I'm document number 2"
"Document 3 here"
"Finally document 4"
"document number 5 in the translog"
"another document (number 6) in the translog"

Files in the data directories

Finally, we can see where the data for the index is stored:

tree ~/src/elasticsearch/data

echo "--------------"

tree /tmp/foo

/Users/hinmanm/src/elasticsearch/data
└── elasticsearch
    └── nodes
        ├── 0
        │   ├── _state
        │   │   └── global-1.st
        │   ├── indices
        │   │   └── myindex
        │   │       ├── 0
        │   │       │   └── _state
        │   │       │       └── state-11.st
        │   │       └── _state
        │   │           └── state-1.st
        │   └── node.lock
        ├── 1
        │   ├── _state
        │   │   └── global-1.st
        │   ├── indices
        │   │   └── myindex
        │   │       ├── 0
        │   │       │   └── _state
        │   │       │       └── state-11.st
        │   │       └── _state
        │   │           └── state-1.st
        │   └── node.lock
        └── 2
            ├── _state
            │   └── global-1.st
            ├── indices
            │   └── myindex
            │       ├── 0
            │       │   └── _state
            │       │       └── state-4.st
            │       └── _state
            │           └── state-1.st
            └── node.lock

23 directories, 12 files
--------------
/tmp/foo
└── myindex
    └── 0
        ├── index
        │   ├── _0.cfe
        │   ├── _0.cfs
        │   ├── _0.si
        │   ├── _1.cfe
        │   ├── _1.cfs
        │   ├── _1.si
        │   ├── _2.cfe
        │   ├── _2.cfs
        │   ├── _2.si
        │   ├── segments_4
        │   └── write.lock
        └── translog

4 directories, 11 files

Shadow Replica Demo

Table of Contents