Add document about adding/replacing a node (#5570)

* Add document about adding/replacing a node * Update nodes.md Amend for comments
2020-03-15 18:32:34 +08:00 · 2020-03-15 18:32:34 +08:00 · ea9f8b4258
commit ea9f8b4258
parent 1cb03a184b
3 changed files with 131 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -93,6 +93,7 @@ vagrant up
 - [vSphere](docs/vsphere.md)
 - [Packet Host](docs/packet.md)
 - [Large deployments](docs/large-deployments.md)
+- [Adding/replacing a node](docs/nodes.md)
 - [Upgrades basics](docs/upgrades.md)
 - [Roadmap](docs/roadmap.md)

--- a/docs/_sidebar.md
+++ b/docs/_sidebar.md
@ -7,6 +7,7 @@
  * [Integration](docs/integration.md)
  * [Upgrades](/docs/upgrades.md)
  * [HA Mode](docs/ha-mode.md)
+  * [Adding/replacing a node](docs/nodes.md)
  * [Large deployments](docs/large-deployments.md)
 * CNI
  * [Calico](docs/calico.md)
--- a/docs/nodes.md
+++ b/docs/nodes.md
@ -0,0 +1,129 @@
+# Adding/replacing a node
+
+Modified from [comments in #3471](https://github.com/kubernetes-sigs/kubespray/issues/3471#issuecomment-530036084)
+
+## Adding/replacing a worker node
+
+This should be the easiest.
+
+### 1) Add new node to the inventory
+
+### 2) Run `scale.yml`
+
+You can use `--limit=node1` to limit Kubespray to avoid disturbing other nodes in the cluster.
+
+### 3) Drain the node that will be removed
+
+```sh
+kubectl drain NODE_NAME
+```
+
+### 4) Run the remove-node.yml playbook
+
+With the old node still in the inventory, run `remove-node.yml`. You need to pass `-e node=NODE_NAME` to the playbook to limit the execution to the node being removed.
+
+### 5) Remove the node from the inventory
+
+That's it.
+
+## Adding/replacing a master node
+
+### 1) Recreate apiserver certs manually to include the new master node in the cert SAN field
+
+For some reason, Kubespray will not update the apiserver certificate.
+
+Edit `/etc/kubernetes/kubeadm-config.yaml`, include new host in `certSANs` list.
+
+Use kubeadm to recreate the certs.
+
+```sh
+cd /etc/kubernetes/ssl
+mv apiserver.crt apiserver.crt.old
+mv apiserver.key apiserver.key.old
+
+cd /etc/kubernetes
+kubeadm init phase certs apiserver --config kubeadm-config.yaml
+```
+
+Check the certificate, new host needs to be there.
+
+```sh
+openssl x509 -text -noout -in /etc/kubernetes/ssl/apiserver.crt
+```
+
+### 2) Run `cluster.yml`
+
+Add the new host to the inventory and run cluster.yml.
+
+### 3) Restart kube-system/nginx-proxy
+
+In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
+
+```sh
+# run in every host
+docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart
+```
+
+### 4) Remove old master nodes
+
+If you are replacing a node, remove the old one from the inventory, and remove from the cluster runtime.
+
+```sh
+kubectl drain NODE_NAME
+kubectl delete node NODE_NAME
+```
+
+After that, the old node can be safely shutdown. Also, make sure to restart nginx-proxy in all remaining nodes (step 3)
+
+From any active master that remains in the cluster, re-upload `kubeadm-config.yaml`
+
+```sh
+kubeadm config upload from-file --config /etc/kubernetes/kubeadm-config.yaml
+```
+
+## Adding/Replacing an etcd node
+
+You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replace or scale up operation. Either add two new nodes or remove an old one.
+
+### 1) Add the new node running cluster.yml
+
+Update the inventory and run `cluster.yml` passing `--limit=etcd,kube-master -e ignore_assert_errors=yes`.
+
+Run `upgrade-cluster.yml` also passing `--limit=etcd,kube-master -e ignore_assert_errors=yes`. This is necessary to update all etcd configuration in the cluster.
+
+At this point, you will have an even number of nodes. Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node. Even so, running applications should continue to be available.
+
+### 2) Remove an old etcd node
+
+With the node still in the inventory, run `remove-node.yml` passing `-e node=NODE_NAME` as the name of the node that should be removed.
+
+### 3) Make sure the remaining etcd members have their config updated
+
+In each etcd host that remains in the cluster:
+
+```sh
+cat /etc/etcd.env | grep ETCD_INITIAL_CLUSTER
+```
+
+Only active etcd members should be in that list.
+
+### 4) Remove old etcd members from the cluster runtime
+
+Acquire a shell prompt into one of the etcd containers and use etcdctl to remove the old member.
+
+```sh
+# list all members
+etcdctl member list
+
+# remove old member
+etcdctl member remove MEMBER_ID
+# careful!!! if you remove a wrong member you will be in trouble
+
+# note: these command lines are actually much bigger, since you need to pass all certificates to etcdctl.
+```
+
+### 5) Make sure the apiserver config is correctly updated
+
+In every master node, edit `/etc/kubernetes/manifests/kube-apiserver.yaml`. Make sure only active etcd nodes are still present in the apiserver command line parameter `--etcd-servers=...`.
+
+### 6) Shutdown the old instance