Segment-based replication (aka Shadow Replicas)

Overview
- Advantages
- Disadvantages
Development Phases
- Phase 1 - Add support for shadow_replicas without normal replicas
- CANCELLED Phase 2 - Add support for shadow_replicas WITH real replicas
Development
- Questions
DONE Per-index paths
NEEDSREVIEW Shadow Engine [30/30] [100%]

Table 1: Clock summary at [2015-02-18 Wed 10:39], for February 2015.
Headline	Time
Total time	3d 8:23
NEEDSREVIEW [#A] Shadow Engine [30/30] [100%]	3d 8:23
Tasks		3:07
DONE don't send docs to the replica when using shared FS			2:38
DONE make ShadowEngine throw exceptions instead of no-ops			0:29

Overview

Taken from: https://github.com/elasticsearch/elasticsearch/issues/8976

Today, Elasticsearch replicates changes at the document level. When a document is indexed/updated on the primary shard, the new version of the document is sent to the replica to be indexed there as well. Document-level replication is what provides distributed near real-time search. Not all customers need (or want) this, as the process of indexing every document twice (primary & replica) has a cost.

Instead, we would like to support segment-based replication as an alternative. With segment based replication, indexing happens only on the primary shard. Instead of the replica shards that we have today, we can have shadow_replicas. When the primary shard flushes (or merges), it would push the new segments and commit point out to the shadow replicas. Shadow replicas would then reopen the searcher to expose the new segments to search. The primary would flush more often than we do today to reduce the lag in replication.

Long-term, it would be nice to support both regular replicas (for failover) and shadow replicas (for search-only)

Advantages

Copying segments across is lower cost than indexing each document twice
Recovery is faster Because the segments on the primary and replica don't diverge in the way they do with document based replication, only new segments are copied over.

Disadvantages

Search and document retrieval is no longer real time. Changes are only exposed to shadow_replicas when the primary flushes, instead of on refreshes
If the primary fails, documents in the transaction log will be lost unless using shared storage
Any file corruption might be copied over to the shadow_replica (or if we fail the primary because of corruption, there is no alternative source of data)
Increased network I/O, as we would need to copy the same data over multiple times when small segments are merged into bigger segments. Copy 5gb segments across the network could introduce a significant amount of lag in the replicas
Potential disadvantage - More frequent flushing on the primary could hurt indexing speed

Development Phases

Phase 1 - Add support for shadow_replicas without normal replicas

(Limited to one node per server/VM)

The first development phase will only add support for indices with either real or shadow_replicas meaning all or none replicas will be shadow_replicas. We likely have to introduce the notion of a shadow_replica on the cluster internal data-structures ie. a flag on the ShardRouting such that we won't have to change the wire-protocol again in the next iteration. On a very high level this first iteration will be implemented as a read-only engine that periodically or on flush refreshes it's local Lucene index.

On the indexing side we "simply" don't forward the the documents to the replicas and return immediately once the primary successfully indexed it.

To catch up with the primary it has to somehow go back to the current primary and sync up it's local Lucene segments. Since we already have this logic implemented and used during recovery large parts of it should be reusable. (See RecoveryTarget / RecoverySource)

The biggest downside of this approach is the potential dataloss if we loose the primary since we don't have any documents replicated since the last flush.

Potential features / pitfalls

We might need to make this useful is an allocation decider that allows only shadow_replicas to be allocated on a certain node
Mappings might be slightly behind if we don't index on certain nodes
Testing might be tricky here - we might be able to randomly swap in stuff like this and trigger flushes instead of refresh transparently. I'd love to implement this somehow that it is entirely transparent. ie. a refresh on a shadow_replica index means full flush + sync even if we have to make this optional with being enabled in tests.

CANCELLED Phase 2 - Add support for shadow_replicas WITH real replicas CANCELLED

State "CANCELLED" from "" [2015-02-02 Mon 15:47]
Not actually planning on doing this

This second phase aims to support full fletched replicas together with shadow_replicas mixed in the same index. This will help users to prevent dataloss if primaries die while still have the ability to have read-only nodes in the cluster which will not be used for indexing.

Development

This is my section to write down all the rabbit-holes I look through trying to figure out where to implement this.

Questions

Can flush time be configured on a per-index basis? YES, it can be a global setting or per-index
How to kick off recovery manually?

Might be able to do this with RecoveryTarget.startRecovery()…

How to inhibit sending indexing requests to the replica shards?

There is an ignoreReplicas method in TransportShardReplicationOperationAction, while hard-coded to false right now, it could be changed to be a per-index setting

DONE Per-index paths

Igor had the great idea to continue storing metadata in a regular index directory (in path.data), and use the settings loaded from that directory to determine which distributor to use for the index, then the per-path settings could be implemented as a distributor.

The other option I could think of was custom metadata in the cluster state that told Elasticsearch where to go looking for other indices when doing the initial import of data.

Not sure which one to go with…

We need to be able to pass a new directory for FsDirectoryService.newFSDirectory

Which is called by FsDirectoryService.build…

Which calls FsIndexStore.shardIndexLocations, which uses the NodeEnvironment to determine where to put it.

It looks like FsIndexStore has access to the index settings, maybe it can maintain a map there and use that to determine where to look for the data?

If you specify a custom data location, how do you handle the situation when data already exists for that index, and you need to tell Elasticsearch about it?

Maybe you can create the index with the path, close it, then re-open it? (DOESN'T WORK)

Gotta have something in the cluster state that tells ES where to look for custom index metadata?

Also, need to figure out how the translog is being created and use the custom location for that.

Create index

PUT /_cluster/settings
{
  "transient": {
    "logger.env": "TRACE"
  }
}

POST /myindex
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    //"data_template": "i_{{index_name}}/{{node_id}}/shard_{{shard_num}}",
    "data_path": "/tmp/myindex"
  }
}

POST /myindex/doc
{"body": "foo"}

POST /myindex/doc
{"body": "bar"}

POST /myindex/doc
{"body": "baz"}

POST /myindex/doc
{"body": "eggplant"}

POST /myindex/_refresh
{}

GET /_cat/count/myindex
{}

{"acknowledged":true,"persistent":{},"transient":{"logger":{"env":"TRACE"}}}
HTTP/1.1 400 Bad Request
Content-Type: application/json; charset=UTF-8
Content-Length: 213

{"error":"IndexCreationException[[myindex] failed to create index]; nested: ElasticsearchIllegalArgumentException[custom data_path is disabled unless all nodes are at least version 1.5.0-SNAPSHOT]; ","status":400}
{"_index":"myindex","_type":"doc","_id":"AUqf-WSCUlSGRj_zSu7l","_version":1,"created":true}
{"_index":"myindex","_type":"doc","_id":"AUqf-WXiUlSGRj_zSu7m","_version":1,"created":true}
{"_index":"myindex","_type":"doc","_id":"AUqf-WZhUlSGRj_zSu7n","_version":1,"created":true}
{"_index":"myindex","_type":"doc","_id":"AUqf-We-UlSGRj_zSu7o","_version":1,"created":true}
{"_shards":{"total":10,"successful":8,"failed":0}}
1420023130 11:52:10 4

Close index

POST /myindex/_close
{}

{"acknowledged":true}

Move data on disk

mv /tmp/myindex /tmp/foo

Update index settings

PUT /myindex/_settings?ignore_unavailable=true
{
  "index": {
    "data_path": "/tmp/foo"
  }
}

GET /myindex/_settings?pretty
{}

{"acknowledged":true}
{
  "myindex" : {
    "settings" : {
      "index" : {
        "number_of_shards" : "2",
        "ignore_unavailable" : "true",
        "creation_date" : "1418997720075",
        "number_of_replicas" : "1",
        "version" : {
          "created" : "2000099"
        },
        "uuid" : "5uwUpVXDTVGsfhuZBMMoSg",
        "data_path" : "/tmp/foo"
      }
    }
  }
}

Re-open and check index

POST /myindex/_open
{}

POST /myindex/_refresh
{}

GET /_cat/count/myindex
{}

{"acknowledged":true}
{"_shards":{"total":4,"successful":2,"failed":0}}
1419931043 10:17:23 4

[file trees]

tree /tmp/foo
tree ~/src/elasticsearch/data

/tmp/foo
├── 0
│   └── myindex
│       ├── 0
│       │   ├── index
│       │   │   ├── _0.cfe
│       │   │   ├── _0.cfs
│       │   │   ├── _0.si
│       │   │   ├── segments_2
│       │   │   └── write.lock
│       │   └── translog
│       │       └── translog-2
│       └── 1
│           ├── index
│           │   ├── _0.cfe
│           │   ├── _0.cfs
│           │   ├── _0.si
│           │   ├── segments_2
│           │   └── write.lock
│           └── translog
│               └── translog-2
└── 1
    └── myindex
        ├── 0
        │   ├── index
        │   │   ├── _0.cfe
        │   │   ├── _0.cfs
        │   │   ├── _0.si
        │   │   ├── segments_2
        │   │   └── write.lock
        │   └── translog
        │       └── translog-3
        └── 1
            ├── index
            │   ├── _0.cfe
            │   ├── _0.cfs
            │   ├── _0.si
            │   ├── segments_2
            │   └── write.lock
            └── translog
                └── translog-3

16 directories, 24 files
/Users/hinmanm/src/elasticsearch/data
└── elasticsearch
    └── nodes
        ├── 0
        │   ├── _state
        │   │   └── global-1.st
        │   ├── indices
        │   │   └── myindex
        │   │       ├── 0
        │   │       │   └── _state
        │   │       │       ├── state-3.st
        │   │       │       └── state-6.st
        │   │       ├── 1
        │   │       │   └── _state
        │   │       │       ├── state-3.st
        │   │       │       └── state-6.st
        │   │       └── _state
        │   │           └── state-5.st
        │   └── node.lock
        └── 1
            ├── _state
            │   └── global-1.st
            ├── indices
            │   └── myindex
            │       ├── 0
            │       │   └── _state
            │       │       └── state-6.st
            │       ├── 1
            │       │   └── _state
            │       │       └── state-6.st
            │       └── _state
            │           └── state-5.st
            └── node.lock

20 directories, 12 files

:END:

NEEDSREVIEW Shadow Engine `[30/30]` `[100%]`

State "NEEDSREVIEW" from "INPROGRESS" [2015-02-18 Wed 08:21]
PR is up: https://github.com/elasticsearch/elasticsearch/pull/9727
State "TODO" from "NEEDSREVIEW" [2015-02-17 Tue 14:29]
State "NEEDSREVIEW" from "INPROGRESS" [2015-02-17 Tue 09:35]
PR: https://github.com/elasticsearch/elasticsearch/pull/9727

Shadow replica is an instance of an IndexShard, the target always starts the recovery, we can encapsulate the entire logic for this in a special IndexShard.

Need a dedicated Engine and a dedicated IndexShard for the shadow replicas.

TO START Index-level setting: shadow_replicas: true, start with this only and work from there

Reproduction for shared filesystem

This is for testing shadow replicas on a shared filesystem

This requires this in elasticsearch.yml

node.add_id_to_custom_path: false
node.enable_custom_paths: true

Create index with shadow_replicas: true

trash -a ~/src/elasticsearch/data

PUT /_cluster/settings
{
  "transient": {
    "logger._root": "DEBUG",
    "logger.env": "TRACE",
    "logger.index": "DEBUG",
    "logger.index.engine.internal": "TRACE",
    "logger.index.gateway": "TRACE",
    "logger.indices.cluster": "TRACE",
    "logger.cluster.action.shard": "TRACE",
    "logger.action.support.replication": "TRACE"
  }
}

POST /myindex
{
  "index": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "data_path": "/tmp/foo",
    "shadow_replicas": true
  }
}

{"acknowledged":true,"persistent":{},"transient":{"logger":{"cluster":{"action":{"shard":"TRACE"}},"_root":"DEBUG","indices":{"cluster":"TRACE"},"engine":{"internal":"TRACE"},"index":"DEBUG","action":{"support":{"replication":"TRACE"}},"env":"TRACE","index.gateway":"TRACE"}}}
{"acknowledged":true}

Adding documents

If implemented correctly, they should only be added to the primary shard and not to any of the replicas.

GET /_cat/shards
{}

index   shard prirep state   docs  store ip        node    
myindex 0     p      STARTED    4 11.1kb 10.0.0.16 Piranha 
myindex 0     r      STARTED    4 11.1kb 10.0.0.16 Buzz

GET /_cat/nodes?h=name,id&full_id
{}

name           id                     
Blonde Phantom sQ2_ZdEjQMaPF4GDAmLJog 
Tempo          pl8yj9dFTZmpgDihYe3uJg

POST /myindex/doc
{"body": "foo bar"}

POST /myindex/doc?refresh
{"body": "foo bar"}

POST /myindex/_flush
{}

POST /myindex/doc
{"body": "foo bar"}

POST /myindex/doc?refresh
{"body": "foo bar"}

GET /_cat/shards
{}

{"_index":"myindex","_type":"doc","_id":"AUtVqJYGgeBguaeEl0KR","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUtVqJhZgeBguaeEl0KS","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_shards":{"total":2,"successful":2,"failed":0}}
{"_index":"myindex","_type":"doc","_id":"AUtVqJjvgeBguaeEl0KT","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUtVqJkPgeBguaeEl0KU","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
index   shard prirep state   docs store ip        node    
myindex 0     p      STARTED    4 8.3kb 10.0.0.16 Box     
myindex 0     r      STARTED    2 8.3kb 10.0.0.16 Piranha

Relocate a shard

I am using this to test what will be shard relocation and recovery

POST /_cluster/reroute
{
  "commands" : [
    {
      "move" :
      {
        "index" : "myindex", "shard" : 0,
        "from_node" : "2lk0Ff_WTfeVNOQzlf5yGA", "to_node" : "RRw_HPuETNOs2gDnr4D4Hg"
      }
    }
  ]
}

Reproduction for regular filesystem

In this case, we need to move segments around whenever the primary flushes

Requires this in elasticsearch.yml

node.enable_custom_paths: true

Create index with shadow_replicas: true

trash -a ~/src/elasticsearch/data

PUT /_cluster/settings
{
  "transient": {
    "logger._root": "DEBUG",
    "logger.env": "TRACE",
    "logger.index": "DEBUG",
    "logger.index.engine.internal": "TRACE",
    "logger.index.gateway": "TRACE",
    "logger.indices.cluster": "TRACE",
    "logger.cluster.action.shard": "TRACE",
    "logger.action.support.replication": "TRACE"
  }
}

POST /myindex
{
  "index": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "shadow_replicas": true,
    "shared_filesystem": false
  }
}

{"acknowledged":true,"persistent":{},"transient":{"logger":{"cluster":{"action":{"shard":"TRACE"}},"_root":"DEBUG","indices":{"cluster":"TRACE"},"engine":{"internal":"TRACE"},"index":"DEBUG","action":{"support":{"replication":"TRACE"}},"env":"TRACE","index.gateway":"TRACE"}}}
{"acknowledged":true}

Adding documents

If implemented correctly, they should only be added to the primary shard and not to any of the replicas.

GET /_cat/shards
{}

index   shard prirep state   docs store ip           node           
myindex 0     p      STARTED    2 5.7kb 192.168.5.65 Blonde Phantom 
myindex 0     r      STARTED    1 2.8kb 192.168.5.65 Mandrill       
myindex 1     r      STARTED    1 2.8kb 192.168.5.65 Blonde Phantom 
myindex 1     p      STARTED    2 5.5kb 192.168.5.65 Tempo

GET /_cat/nodes?h=name,id&full_id
{}

name           id                     
Blonde Phantom sQ2_ZdEjQMaPF4GDAmLJog 
Tempo          pl8yj9dFTZmpgDihYe3uJg

POST /myindex/doc
{"body": "foo bar"}

POST /myindex/doc?refresh
{"body": "foo bar"}

POST /myindex/_flush
{}

POST /myindex/doc
{"body": "foo bar"}

POST /myindex/doc?refresh
{"body": "foo bar"}

GET /_cat/shards
{}

{"_index":"myindex","_type":"doc","_id":"AUuAI-hVJM_bJBgIb2_d","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUuAI-l2JM_bJBgIb2_e","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_shards":{"total":4,"successful":4,"failed":0}}
{"_index":"myindex","_type":"doc","_id":"AUuAI-qhJM_bJBgIb2_f","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
{"_index":"myindex","_type":"doc","_id":"AUuAI-rDJM_bJBgIb2_g","_version":1,"_shards":{"total":2,"successful":2,"failed":0},"created":true}
index   shard prirep state   docs store ip           node           
myindex 0     p      STARTED    1 2.8kb 192.168.5.65 Blonde Phantom 
myindex 0     r      STARTED    1 2.8kb 192.168.5.65 Tempo          
myindex 1     r      STARTED    1 2.8kb 192.168.5.65 Blonde Phantom 
myindex 1     p      STARTED    2 5.5kb 192.168.5.65 Tempo

Tasks

DONE resolve the rest of the nocommits in shadow-replicas branch

DONE relocating a replica doesn't work

Right now relocating a shard doesn't work because flushing throws a missing file error for some reason :(

I need to track down where the flush is occurring and why it is happening even though both engines should be ignoring the flush command.

Everything looks peachy on the old replica side when relocating:

[DEBUG][discovery.zen.publish    ] [Flambe] received cluster state version 10
[DEBUG][cluster.service          ] [Flambe] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: execute
[DEBUG][cluster.service          ] [Flambe] cluster state updated, version [10], source [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]
[DEBUG][cluster.service          ] [Flambe] set local cluster state to version 10
[DEBUG][cluster.service          ] [Flambe] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: done applying updated cluster_state (version: 10)
[DEBUG][discovery.zen.publish    ] [Flambe] received cluster state version 11
[DEBUG][cluster.service          ] [Flambe] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: execute
[DEBUG][cluster.service          ] [Flambe] cluster state updated, version [11], source [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]
[DEBUG][cluster.service          ] [Flambe] set local cluster state to version 11
[DEBUG][indices.cluster          ] [Flambe] [myindex][0] removing shard (not allocated)
[DEBUG][index                    ] [Flambe] [myindex] [0] closing... (reason: [removing shard (not allocated)])
[DEBUG][index.shard              ] [Flambe] [myindex][0] state: [STARTED]->[CLOSED], reason [removing shard (not allocated)]
[DEBUG][index.engine.internal    ] [Flambe] [myindex][0] close now acquire writeLock
[DEBUG][index.engine.internal    ] [Flambe] [myindex][0] close acquired writeLock
[DEBUG][index.engine.internal    ] [Flambe] [myindex][0] close searcherManager
[TRACE][env                      ] [Flambe] shard lock wait count for [[myindex][0]] is now [0]
[TRACE][env                      ] [Flambe] last shard lock wait decremented, removing lock for [[myindex][0]]
[TRACE][env                      ] [Flambe] released shard lock for [[myindex][0]]
[DEBUG][index.store              ] [Flambe] [myindex][0] store reference count on close: 0
[DEBUG][index                    ] [Flambe] [myindex] [0] closed (reason: [removing shard (not allocated)])
[DEBUG][indices.cluster          ] [Flambe] [myindex] cleaning index (no shards allocated)
[DEBUG][indices                  ] [Flambe] [myindex] closing ... (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing index service (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing index cache (reason [removing index (no shards allocated)])
[DEBUG][index.cache.filter.weighted] [Flambe] [myindex] full cache clear, reason [close]
[DEBUG][index.cache.bitset       ] [Flambe] [myindex] clearing all bitsets because [close]
[DEBUG][indices                  ] [Flambe] [myindex] clearing index field data (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing analysis service (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing mapper service (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing index query parser service (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closing index service (reason [removing index (no shards allocated)])
[DEBUG][indices                  ] [Flambe] [myindex] closed... (reason [removing index (no shards allocated)])
[DEBUG][cluster.service          ] [Flambe] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: done applying updated cluster_state (version: 11)
[DEBUG][cluster.service          ] [Flambe] processing [indices_store]: execute
[DEBUG][indices.store            ] [Flambe] [myindex][0] deleting shard that is no longer used
[TRACE][env                      ] [Flambe] deleting shard [myindex][0] directory, paths: [[/Users/hinmanm/src/elasticsearch/data/elasticsearch/nodes/2/indices/myindex/0]]
[TRACE][env                      ] [Flambe] acquiring node shardlock on [[myindex][0]], timeout [0]
[TRACE][env                      ] [Flambe] successfully acquired shardlock for [[myindex][0]]
[TRACE][env                      ] [Flambe] skipping shard deletion because [myindex][0] uses shadow replicas
[TRACE][env                      ] [Flambe] shard lock wait count for [[myindex][0]] is now [0]
[TRACE][env                      ] [Flambe] last shard lock wait decremented, removing lock for [[myindex][0]]
[TRACE][env                      ] [Flambe] released shard lock for [[myindex][0]]
[DEBUG][cluster.service          ] [Flambe] processing [indices_store]: no change in cluster_state

As well as on the new (receiving) replica side:

[DEBUG][discovery.zen.publish    ] [Jane Kincaid] received cluster state version 10
[DEBUG][cluster.service          ] [Jane Kincaid] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: execute
[DEBUG][cluster.service          ] [Jane Kincaid] cluster state updated, version [10], source [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]
[DEBUG][cluster.service          ] [Jane Kincaid] set local cluster state to version 10
[DEBUG][indices.cluster          ] [Jane Kincaid] [myindex] creating index
[DEBUG][indices                  ] [Jane Kincaid] creating Index [myindex], shards [1]/[1]
[DEBUG][index.mapper             ] [Jane Kincaid] [myindex] using dynamic[true], default mapping: default_mapping_location[null], loaded_from[file:/Users/hinmanm/src/elasticsearch/target/classes/org/elasticsearch/index/mapper/default-mapping.json], default percolator mapping: location[null], loaded_from[null]
[DEBUG][index.cache.query.parser.resident] [Jane Kincaid] [myindex] using [resident] query cache with max_size [100], expire [null]
[DEBUG][index.store.fs           ] [Jane Kincaid] [myindex] using index.store.throttle.type [none], with index.store.throttle.max_bytes_per_sec [0b]
[DEBUG][indices.cluster          ] [Jane Kincaid] [myindex] adding mapping [doc], source [{"doc":{"properties":{"body":{"type":"string"}}}}]
[DEBUG][indices.cluster          ] [Jane Kincaid] [myindex][0] creating shard
[TRACE][env                      ] [Jane Kincaid] acquiring node shardlock on [[myindex][0]], timeout [5000]
[TRACE][env                      ] [Jane Kincaid] successfully acquired shardlock for [[myindex][0]]
[DEBUG][index                    ] [Jane Kincaid] [myindex] creating shard_id [myindex][0]
[DEBUG][index.store.fs           ] [Jane Kincaid] [myindex] using [/tmp/foo/myindex/0/index] as shard's index location
[DEBUG][index.merge.scheduler    ] [Jane Kincaid] [myindex][0] using [concurrent] merge scheduler with max_thread_count[2], max_merge_count[7], auto_throttle[true]
[DEBUG][index.store.fs           ] [Jane Kincaid] [myindex] using [/tmp/foo/myindex/0/translog] as shard's translog location
[DEBUG][index.deletionpolicy     ] [Jane Kincaid] [myindex][0] Using [keep_only_last] deletion policy
[DEBUG][index.merge.policy       ] [Jane Kincaid] [myindex][0] using [tiered] merge mergePolicy with expunge_deletes_allowed[10.0], floor_segment[2mb], max_merge_at_once[10], max_merge_at_once_explicit[30], max_merged_segment[5gb], segments_per_tier[10.0], reclaim_deletes_weight[2.0]
[DEBUG][index.shard              ] [Jane Kincaid] [myindex][0] state: [CREATED]
[DEBUG][index.translog           ] [Jane Kincaid] [myindex][0] interval [5s], flush_threshold_ops [2147483647], flush_threshold_size [512mb], flush_threshold_period [30m]
[DEBUG][index.shard              ] [Jane Kincaid] [myindex][0] state: [CREATED]->[RECOVERING], reason [from [Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]
[DEBUG][cluster.service          ] [Jane Kincaid] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: done applying updated cluster_state (version: 10)
[INFO ][index.engine.internal    ] [Jane Kincaid] [myindex][0] cowardly refusing to CREATE
[INFO ][index.engine.internal    ] [Jane Kincaid] [myindex][0] cowardly refusing to CREATE
[DEBUG][index.shard              ] [Jane Kincaid] [myindex][0] state: [RECOVERING]->[POST_RECOVERY], reason [post recovery]
[DEBUG][index.shard              ] [Jane Kincaid] [myindex][0] scheduling refresher every 1s
[DEBUG][indices.recovery         ] [Jane Kincaid] [myindex][0] recovery completed from [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}], took [168ms]
[DEBUG][cluster.action.shard     ] [Jane Kincaid] sending shard started for [myindex][0], node[KaR0Q_QFSs2Capu6kCBPaw], relocating [MqFW42-ZT8G6AL8zR2krTg], [R], s[INITIALIZING], indexUUID [HjGOiD2dTRiSDZFuWdyf8A], reason [after recovery (replica) from node [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]]
[DEBUG][discovery.zen.publish    ] [Jane Kincaid] received cluster state version 11
[DEBUG][cluster.service          ] [Jane Kincaid] processing [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]: execute
[DEBUG][cluster.service          ] [Jane Kincaid] cluster state updated, version [11], source [zen-disco-receive(from master [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}])]
[DEBUG][cluster.service          ] [Jane Kincaid] set local cluster state to version 11
[DEBUG][index.shard              ] [Jane Kincaid] [myindex][0] state: [POST_RECOVERY]->[STARTED], reason [global state is [STARTED]]

However, on the primary side:

[DEBUG][cluster.service          ] [Xemu] processing [cluster_reroute (api)]: execute
[DEBUG][cluster.routing.allocation.decider] [Xemu] Node [eJj0uF6YQIGQ0610E9-Rqg] has 30.184784851314088% free disk (75403067392 bytes)
[DEBUG][cluster.service          ] [Xemu] cluster state updated, version [10], source [cluster_reroute (api)]
[DEBUG][cluster.service          ] [Xemu] publishing cluster state version 10
[DEBUG][cluster.service          ] [Xemu] set local cluster state to version 10
[DEBUG][river.cluster            ] [Xemu] processing [reroute_rivers_node_changed]: execute
[DEBUG][river.cluster            ] [Xemu] processing [reroute_rivers_node_changed]: no change in cluster_state
[DEBUG][cluster.service          ] [Xemu] processing [cluster_reroute (api)]: done applying updated cluster_state (version: 10)
[DEBUG][cluster.service          ] [Xemu] processing [recovery_mapping_check]: execute
[DEBUG][cluster.service          ] [Xemu] processing [recovery_mapping_check]: no change in cluster_state
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: commit: start
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: commit: enter lock
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: commit: now prepare
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: prepareCommit: flush
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW:   index before flush _0(5.1.0):c2 _1(5.1.0):c2
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] DW: startFullFlush
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: apply all deletes during flush
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: now apply all deletes for all segments maxDoc=4
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] BD: applyDeletes: open segment readers took 0 msec
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] BD: applyDeletes: no segments; skipping
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] BD: prune sis=segments: _0(5.1.0):c2 _1(5.1.0):c2 minGen=3 packetCount=0
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] DW: elasticsearch[Xemu][generic][T#7] finishFullFlush success=true
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: startCommit(): start
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: startCommit index=_0(5.1.0):c2 _1(5.1.0):c2 changeCount=8
[TRACE][index.engine.internal.lucene.iw] [Xemu] [myindex][0] elasticsearch[Xemu][generic][T#7] IW: hit exception committing segments file
[WARN ][index.engine.internal    ] [Xemu] [myindex][0] failed to flush shard post recovery
org.elasticsearch.index.engine.FlushFailedEngineException: [myindex][0] Flush failed
  at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:799)
  at org.elasticsearch.index.engine.internal.InternalEngine$FlushingRecoveryCounter.endRecovery(InternalEngine.java:1435)
  at org.elasticsearch.index.engine.internal.RecoveryCounter.close(RecoveryCounter.java:63)
  at org.elasticsearch.common.lease.Releasables.close(Releasables.java:45)
  at org.elasticsearch.common.lease.Releasables.close(Releasables.java:60)
  at org.elasticsearch.common.lease.Releasables.close(Releasables.java:81)
  at org.elasticsearch.common.lease.Releasables.close(Releasables.java:89)
  at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1048)
  at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:673)
  at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:120)
  at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:48)
  at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:141)
  at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:127)
  at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:276)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.NoSuchFileException: /private/tmp/foo/myindex/0/index/_1.cfs
  at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
  at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
  at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
  at java.nio.channels.FileChannel.open(FileChannel.java:287)
  at java.nio.channels.FileChannel.open(FileChannel.java:335)
  at org.apache.lucene.util.IOUtils.fsync(IOUtils.java:392)
  at org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:281)
  at org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:226)
  at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:78)
  at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:78)
  at org.apache.lucene.store.FilterDirectory.sync(FilterDirectory.java:78)
  at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4284)
  at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2721)
  at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2824)
  at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2791)
  at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:785)
  ... 17 more
[DEBUG][cluster.action.shard     ] [Xemu] received shard started for [myindex][0], node[KaR0Q_QFSs2Capu6kCBPaw], relocating [MqFW42-ZT8G6AL8zR2krTg], [R], s[INITIALIZING], indexUUID [HjGOiD2dTRiSDZFuWdyf8A], reason [after recovery (replica) from node [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]]
[DEBUG][cluster.service          ] [Xemu] processing [shard-started ([myindex][0], node[KaR0Q_QFSs2Capu6kCBPaw], relocating [MqFW42-ZT8G6AL8zR2krTg], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]]]: execute
[DEBUG][cluster.action.shard     ] [Xemu] [myindex][0] will apply shard started [myindex][0], node[KaR0Q_QFSs2Capu6kCBPaw], relocating [MqFW42-ZT8G6AL8zR2krTg], [R], s[INITIALIZING], indexUUID [HjGOiD2dTRiSDZFuWdyf8A], reason [after recovery (replica) from node [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]]
[DEBUG][cluster.routing.allocation.decider] [Xemu] Node [eJj0uF6YQIGQ0610E9-Rqg] has 30.184784851314088% free disk (75403067392 bytes)
[DEBUG][cluster.routing.allocation.decider] [Xemu] Node [KaR0Q_QFSs2Capu6kCBPaw] has 30.184784851314088% free disk (75403067392 bytes)
[DEBUG][cluster.service          ] [Xemu] cluster state updated, version [11], source [shard-started ([myindex][0], node[KaR0Q_QFSs2Capu6kCBPaw], relocating [MqFW42-ZT8G6AL8zR2krTg], [R], s[INITIALIZING]), reason [after recovery (replica) from node [[Xemu][eJj0uF6YQIGQ0610E9-Rqg][Xanadu.local][inet[/10.0.0.16:9300]]{add_id_to_custom_path=false, enable_custom_paths=true}]]]
[DEBUG][cluster.service          ] [Xemu] publishing cluster state version 11
[DEBUG][cluster.service          ] [Xemu] set local cluster state to version 11

Here's the full primary log:

relocate-replica.log

Aha! I know why, it's because when we open the IndexReader on the new replica, it removes all the files the last commit point doesn't know about, which would be the _1.cfs file that the primary is complaining about…

Fixed this by not removing anything if shadow replicas are in use for the index, but I'm not sure how happy I am with this solution… are we going to leak files in the directory?

DONE relocating a primary doesn't work

So, because the original primary still contains a lock on the write.lock file, the soon-to-become primary can't open an IndexWriter on it, so hrm…

Shay: we should just throw an exception if a primary tries to relocate, fail the primary and a new one will be promoted out of the replicas. This will unblock for the time being.

In the future, we might need to close the InternalEngine for the primary before relocating the data to the new primary. Either that, or we need to keep the newly created primary from opening an IndexWriter until all the segments are copied over.

ShadowEngine tests `[8/8]`

Yep, I need to write lots of them…

DONE test basic engine things

that all the write operations don't actually get executed, basically that it is read-only

DONE test that a flush & refresh makes new segments visible to replica

of course, only on a shared file-system

DONE test that replica -> primary promotion works

this works in manual testing, but need a test in Java that does it

Simon is working on this

DONE test that replica relocation works

This should be the easier of the two, since it doesn't actually have to replay anything, just create a new IndexReader on the replica

DONE test that primary relocation works

This is tough because of the IndexWriter problem

Right now we fail the shard, is this desired?

DONE test that deleting an index with shadow replicas cleans up the disk

Simon may do this, or I may, depending on who gets to it first.

I added this test.

DONE figure out the Mock modules snafu

Right now all the Mock wrappers we have fail the tests like crazy, but I don't know how to disable them from the test itself. Figure this out and then figure out how to re-enable them and not fail the tests

This is the tests.enable_mock_modules setting (in InternalTestCluster), which defaults to true. With this enabled, I get failures like:

java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still open files: {_0.cfs=1}
	at __randomizedtesting.SeedInfo.seed([8B82D7C4C56C71B1:DC4DF66AAB62646F]:0)
	at org.apache.lucene.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:762)
	at org.elasticsearch.test.store.MockDirectoryHelper$ElasticsearchMockDirectoryWrapper.close(MockDirectoryHelper.java:135)
	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAllFilesClosed(ElasticsearchAssertions.java:681)
	at org.elasticsearch.test.TestCluster.assertAfterTest(TestCluster.java:85)
	at org.elasticsearch.test.InternalTestCluster.assertAfterTest(InternalTestCluster.java:1743)
	at org.elasticsearch.test.ElasticsearchIntegrationTest.afterInternal(ElasticsearchIntegrationTest.java:618)

Figure this out with Simon

DONE disable checking index when closed

Trying to check a replica that's been closed doesn't work, because the primary is still using that data so there's still a lock on the directory and checking opens a new lock.

CANCELLED shadow engine freaks out if there are no segments* files on disk CANCELLED

State "CANCELLED" from "TODO" [2015-02-11 Wed 15:59]
Simon said this is as-expected

See:

org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(ElasticsearchMockDirectoryWrapper(least_used[default(mmapfs(/private/var/folders/kb/fplmcbhs4z52p4x59cqgm4kw0000gn/T/tests-20150211010430-146/0000 has-space/test/2/index),niofs(/private/var/folders/kb/fplmcbhs4z52p4x59cqgm4kw0000gn/T/tests-20150211010430-146/0000 has-space/test/2/index))])): files: [recovery.1423613070833.segments_1, write.lock]
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:637)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:68)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
	at org.elasticsearch.index.engine.ShadowEngine.<init>(ShadowEngine.java:68)
	at org.elasticsearch.index.engine.ShadowEngineFactory.newEngine(ShadowEngineFactory.java:28)
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1170)

This happens if the shadow engine is created on a new index before the InternalEngine

DONE make shadow replicas not delete the data directory, the primary will do that

I changed NodeEnvironment.deleteShardDirectorySafe not to delete the shard directory if shadow replicas are used, deleting the index directory will take care of that.

DONE add a read only block while recovering translog on primary

Currently we replay the translog "internally" when a shadow replica has been promoted to a primary, however, we don't actually block any indexing operations while this replaying is going on. We need to add a block while this is going on.

Clint, see: https://github.com/elasticsearch/elasticsearch/pull/9203

Maybe Tanguy knows more about how to add this dynamically.

Fixed this by failing the shard if it is promoted to a primary

DONE make shadow replicas clean up if they are the last deletion in the cluster

Currently I prevent NodeEnvironment from deleting anything if shadow replicas are used, but I need a way to ensure the last deletion in the cluster actually removes the directory.

Right now, whoever uses this will be responsible for removing the data themselves.

This is a subset of this, but implemented correctly. Simon says that the deletion should move up to the IndexService instead of living in a lower level.

Simon fixed this! \o/

DONE update shadow replica issue with new plan (`shadow_replicas: true`)

DONE write.lock missing when closing primary shard?!

I think NodeEnvironment is doing this

Found it:

[2015-01-16 15:25:24,590][TRACE][index.store.deletes      ] [Thane Ector][myindex][0] recovery CleanFilesRequestHandler: delete file FOO
[2015-01-16 15:25:24,591][TRACE][index.store.deletes      ] [Thane Ector][myindex][0] StoreDirectory.deleteFile: delete file FOO

It's CleanFilesRequestHandler, which invokes Store.cleanupAndVerify, which in turn deletes every file not contained in the source's metadata.

See:

[2015-01-14 14:51:16,431][INFO ][node                     ] [Gamora] stopping ...
[2015-01-14 14:51:16,462][DEBUG][index.engine.internal    ] [Gamora] [myindex][0] close now acquire writeLock
[2015-01-14 14:51:16,462][DEBUG][index.engine.internal    ] [Gamora] [myindex][0] close acquired writeLock
[2015-01-14 14:51:16,462][DEBUG][index.engine.internal    ] [Gamora] [myindex][0] close searcherManager
[2015-01-14 14:51:16,463][DEBUG][index.engine.internal    ] [Gamora] [myindex][0] rollback indexWriter
[2015-01-14 14:51:16,469][WARN ][index.engine.internal    ] [Gamora] [myindex][0] failed to rollback writer on close
java.nio.file.NoSuchFileException: /private/tmp/foo/myindex/0/index/write.lock
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
	at sun.nio.fs.UnixPath.toRealPath(UnixPath.java:837)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.clearLockHeld(NativeFSLockFactory.java:182)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.close(NativeFSLockFactory.java:172)
	at org.apache.lucene.util.IOUtils.close(IOUtils.java:96)
	at org.apache.lucene.util.IOUtils.close(IOUtils.java:83)
	at org.apache.lucene.index.IndexWriter.rollbackInternal(IndexWriter.java:2044)
	at org.apache.lucene.index.IndexWriter.rollback(IndexWriter.java:1972)
	at org.elasticsearch.index.engine.internal.InternalEngine.close(InternalEngine.java:1229)
	at org.apache.lucene.util.IOUtils.close(IOUtils.java:96)
	at org.apache.lucene.util.IOUtils.close(IOUtils.java:83)
	at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:669)
	at org.elasticsearch.index.IndexService.closeShardInjector(IndexService.java:384)
	at org.elasticsearch.index.IndexService.removeShard(IndexService.java:363)
	at org.elasticsearch.index.IndexService.close(IndexService.java:255)
	at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:373)
	at org.elasticsearch.indices.IndicesService.access$000(IndicesService.java:89)
	at org.elasticsearch.indices.IndicesService$1.run(IndicesService.java:131)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

DONE ensure no merges run on shadow replicas

Need to use a custom merge policy and/or scheduler?

I think this is handled by not registering a merge policy with the IndexWriter, but I should make sure the IW isn't installing a default policy.

DONE Tell Luca and Clint about the typo in `ShardStateAction`

We have:

public static final String SHARD_STARTED_ACTION_NAME = "internal:cluster/shard/failure";
public static final String SHARD_FAILED_ACTION_NAME = "internal:cluster/shard/started";

DONE ensure shadow becomes normal after switching from replica to primary

I kind of added this in IndexShard.shardRouting() already

DONE handle replaying translog on recovery if replica is shadow

I think this can be done by creating a new shard and doing local recovery, but will need to check more.

Look in IndexShardGatewayService.recover and InedxShardGateway.recover, this is where we recover from translog. Can I just call it manually?

[DEBUG][cluster.action.shard] sending shard started for [myindex][0], node[4-kS-6aGQUWE7Qi65hxUEg], [P], s[STARTED], indexUUID [FEBqQ6I4Q0asig34M1Ai-w], reason [after recovery from gateway]
[DEBUG][cluster.action.shard] received shard started for [myindex][0], node[4-kS-6aGQUWE7Qi65hxUEg], [P], s[STARTED], indexUUID [FEBqQ6I4Q0asig34M1Ai-w], reason [after recovery from gateway]
[DEBUG][cluster.action.shard] [myindex][0] ignoring shard started event for [myindex][0], node[4-kS-6aGQUWE7Qi65hxUEg], [P], s[STARTED], indexUUID [FEBqQ6I4Q0asig34M1Ai-w], reason [after recovery from gateway], current state: STARTED

DONE create the shard routing with the shadow flag in RoutingNodes

CANCELLED sometimes a regular engine gets created instead of a shadow one CANCELLED

State "CANCELLED" from "TODO" [2015-01-14 Wed 14:50]
fixed this

CANCELLED don't replicate operations if the replica is a shadow replica CANCELLED

State "CANCELLED" from "TODO" [2015-01-14 Wed 14:50]
not doing this, the operation is transmitted but ignored

CANCELLED Figure out how to send shard started event after re-creation CANCELLED

State "CANCELLED" from "TODO" [2015-02-02 Mon 15:09]
don't need this anymore, we are recovering internally

The shard recreates fine, but whenever we try to send the "hey I'm started now", the master rejects it because it thinks the shard is already started.

DONE don't send docs to the replica when using shared FS

[2015-02-13 Fri 09:03]

I think we can stick this in TransportShardReplicationOperationAction

CANCELLED ensure mapping is updated on replica when request is not sent CANCELLED

State "CANCELLED" from "TODO" [2015-02-17 Tue 09:34]
Simon said Shay mentioned this is as expected

Related to don't send docs to the replica when using shared FS, there is currently a nocommit for when dynamic mappings are used. It looks like the replica doesn't get the new mappings until much later in this case.

DONE add tests for failover when indexing concurrently

[2015-02-13 Fri 09:03]

DONE make ShadowEngine throw exceptions instead of no-ops

[2015-02-13 Fri 12:04]

CANCELLED look into waiting for mapping changes on refresh for SRs CANCELLED

State "CANCELLED" from "TODO" [2015-02-17 Tue 09:35]
See: ensure mapping is updated on replica when request is not sent

[2015-02-13 Fri 12:10]

Segment-based replication (aka Shadow Replicas)

Table of Contents

Overview

Advantages

Disadvantages

Development Phases

Phase 1 - Add support for shadow_replicas without normal replicas

CANCELLED Phase 2 - Add support for shadow_replicas WITH real replicas CANCELLED

Development

Questions

DONE Per-index paths

Create index

Close index

Move data on disk

Update index settings

Re-open and check index

[file trees]

NEEDSREVIEW Shadow Engine [30/30] [100%]

Reproduction for shared filesystem

Create index with shadow_replicas: true

Adding documents

Relocate a shard

Reproduction for regular filesystem

Create index with shadow_replicas: true

Adding documents

Tasks

DONE resolve the rest of the nocommits in shadow-replicas branch

DONE relocating a replica doesn't work

DONE relocating a primary doesn't work

ShadowEngine tests [8/8]

DONE test basic engine things

DONE test that a flush & refresh makes new segments visible to replica

DONE test that replica -> primary promotion works

DONE test that replica relocation works

DONE test that primary relocation works

DONE test that deleting an index with shadow replicas cleans up the disk

DONE figure out the Mock modules snafu

DONE disable checking index when closed

CANCELLED shadow engine freaks out if there are no segments* files on disk CANCELLED

DONE make shadow replicas not delete the data directory, the primary will do that

DONE add a read only block while recovering translog on primary

DONE make shadow replicas clean up if they are the last deletion in the cluster

DONE update shadow replica issue with new plan (shadow_replicas: true)

DONE write.lock missing when closing primary shard?!

DONE ensure no merges run on shadow replicas

DONE Tell Luca and Clint about the typo in ShardStateAction

DONE ensure shadow becomes normal after switching from replica to primary

DONE handle replaying translog on recovery if replica is shadow

DONE create the shard routing with the shadow flag in RoutingNodes

CANCELLED sometimes a regular engine gets created instead of a shadow one CANCELLED

CANCELLED don't replicate operations if the replica is a shadow replica CANCELLED

CANCELLED Figure out how to send shard started event after re-creation CANCELLED

DONE don't send docs to the replica when using shared FS

CANCELLED ensure mapping is updated on replica when request is not sent CANCELLED

DONE add tests for failover when indexing concurrently

DONE make ShadowEngine throw exceptions instead of no-ops

CANCELLED look into waiting for mapping changes on refresh for SRs CANCELLED

NEEDSREVIEW Shadow Engine `[30/30]` `[100%]`

ShadowEngine tests `[8/8]`

DONE update shadow replica issue with new plan (`shadow_replicas: true`)

DONE Tell Luca and Clint about the typo in `ShardStateAction`