c12s-kubespray/docs/large-deployments.md
Matthew Mosesohn c96fa2f4fc Add scale thresholds to split etcd and k8s-masters
Also adds calico-rr group if there are standalone etcd nodes.
Now if there are 50 or more nodes, 3 etcd nodes will be standalone.
If there are 200 or more nodes, 2 kube-masters will be standalone.
If thresholds are exceeded, kube-node group cannot add nodes that
belong to etcd or kube-master groups (according to above statements).
2017-01-19 17:30:56 +03:00

2 KiB

Large deployments of K8s

For a large scaled deployments, consider the following configuration changes:

  • Tune ansible settings for forks and timeout vars to fit large numbers of nodes being deployed.

  • Override containers' foo_image_repo vars to point to intranet registry.

  • Override the download_run_once: true and/or download_localhost: true. See download modes for details.

  • Adjust the retry_stagger global var as appropriate. It should provide sane load on a delegate (the first K8s master node) then retrying failed push or download operations.

  • Tune parameters for DNS related applications (dnsmasq daemon set, kubedns replication controller). Those are dns_replicas, dns_cpu_limit, dns_cpu_requests, dns_memory_limit, dns_memory_requests. Please note that limits must always be greater than or equal to requests.

  • Tune CPU/memory limits and requests. Those are located in roles' defaults and named like foo_memory_limit, foo_memory_requests and foo_cpu_limit, foo_cpu_requests. Note that 'Mi' memory units for K8s will be submitted as 'M', if applied for docker run, and cpu K8s units will end up with the 'm' skipped for docker as well. This is required as docker does not understand k8s units well.

  • Add calico-rr nodes if you are deploying with Calico or Canal. Nodes recover from host/network interruption much quicker with calico-rr. Note that calico-rr role must be on a host without kube-master or kube-node role (but etcd role is okay).

  • Check out the Inventory section of the Getting started guide for tips on creating a large scale Ansible inventory.

For example, when deploying 200 nodes, you may want to run ansible with --forks=50, --timeout=600 and define the retry_stagger: 60.