c12s-kubespray/docs/recover-control-plane.md
qvicksilver ac2135e450
Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500)
* Fix recover-control-plane to work with etcd 3.3.x and add CI

* Set default values for testcase

* Add actual test jobs

* Attempt to satisty gitlab ci linter

* Fix ansible targets

* Set etcd_member_name as stated in the docs...

* Recovering from 0 masters is not supported yet

* Add other master to broken_kube-master group as well

* Increase number of retries to see if etcd needs more time to heal

* Make number of retries for ETCD loops configurable, increase it for recovery CI and document it
2020-02-11 01:38:01 -08:00

42 lines
2 KiB
Markdown

# Recovering the control plane
To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
* Backup what you can
* Provision new nodes to replace the broken ones
* Place the surviving nodes of the control plane first in the "etcd" and "kube-master" groups
* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube-master" groups
Examples of what broken means in this context:
* One or more bare metal node(s) suffer from unrecoverable hardware failure
* One or more node(s) fail during patching or upgrading
* Etcd database corruption
* Other node related failures leaving your control plane degraded or nonfunctional
__Note that you need at least one functional node to be able to recover using this method.__
## Runbook
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
* Move any broken master nodes into the "broken\_kube-master" group.
Then run the playbook with ```--limit etcd,kube-master``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
When finished you should have a fully working control plane again.
## Recover from lost quorum
The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
```-e etcd_snapshot=/tmp/etcd_snapshot```
## Caveats
* The playbook has only been tested with fairly small etcd databases.
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
* There may be disruptions while running the playbook.
* There are absolutely no guarantees.
If possible try to break a cluster in the same way that your target cluster is broken and test to recover that before trying on the real target cluster.