c12s-kubespray/docs/large-deployments.md

Large deployments of K8s
========================

For a large scaled deployments, consider the following configuration changes:

* Tune [ansible settings](http://docs.ansible.com/ansible/intro_configuration.html)
  for `forks` and `timeout` vars to fit large numbers of nodes being deployed.

* Override containers' `foo_image_repo` vars to point to intranet registry.

* Override the ``download_run_once: true`` to download container images only once
  then push to cluster nodes in batches. The default delegate node
  for pushing images is the first kube-master. Note, if you have passwordless sudo
  and docker enabled on the separate admin node, you may want to define the
  ``download_localhost: true``, which makes that node a delegate for pushing images
  while running the deployment with ansible. This maybe the case if cluster nodes
  cannot access each over via ssh or you want to use local docker images as a cache
  for multiple clusters.

* Adjust the `retry_stagger` global var as appropriate. It should provide sane
  load on a delegate (the first K8s master node) then retrying failed
  push or download operations.

* Tune parameters for DNS related applications (dnsmasq daemon set, kubedns
  replication controller). Those are ``dns_replicas``, ``dns_cpu_limit``,
  ``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``.
  Please note that limits must always be greater than or equal to requests.

For example, when deploying 200 nodes, you may want to run ansible with
``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.
Add retry_stagger var for failed download/pushes. * Add the retry_stagger var to tweak push and retry time strategies. * Add large deployments related docs. Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com> 2016-09-15 09:23:27 +00:00			`Large deployments of K8s`
			`========================`

			`For a large scaled deployments, consider the following configuration changes:`

			`* Tune [ansible settings](http://docs.ansible.com/ansible/intro_configuration.html)`
			for `forks` and `timeout` vars to fit large numbers of nodes being deployed.

			* Override containers' `foo_image_repo` vars to point to intranet registry.

Fix download dnsmasq image dependency on docker When download_run_once with download_localhost is used, docker is expected to be running on the delegate localhost. That may be not the case for a non localhost delegate, which is the kube-master otherwise. Then the dnsmasq role, had it been invoked early before deployment starts, would fail because of the missing docker dependency. * Fix that dependency on docker and do not pre download dnsmasq image for the dnsmasq role, if download_localhost is disabled. * Remove become: false for docker CLI invocation because that's not the common pattern to allow users access docker CLI w/o sudo. * Fix opt bin path hack for localhost delegate to ignore errors when it fails with "sudo password required" otherwise. * Describe download_run_once with download_localhost use case in docs as well. Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com> 2016-11-24 15:33:45 +00:00			* Override the ``download_run_once: true`` to download container images only once
			`then push to cluster nodes in batches. The default delegate node`
			`for pushing images is the first kube-master. Note, if you have passwordless sudo`
			`and docker enabled on the separate admin node, you may want to define the`
			``download_localhost: true``, which makes that node a delegate for pushing images
			`while running the deployment with ansible. This maybe the case if cluster nodes`
			`cannot access each over via ssh or you want to use local docker images as a cache`
			`for multiple clusters.`
Add retry_stagger var for failed download/pushes. * Add the retry_stagger var to tweak push and retry time strategies. * Add large deployments related docs. Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com> 2016-09-15 09:23:27 +00:00
			* Adjust the `retry_stagger` global var as appropriate. It should provide sane
			`load on a delegate (the first K8s master node) then retrying failed`
			`push or download operations.`

Tune dnsmasq/kubedns limits, replicas, logging * Add dns_replicas, dns_memory/cpu_limit/requests vars for dns related apps. * When kube_log_level=4, log dnsmasq queries as well. * Add log level control for skydns (part of kubedns app). * Add limits/requests vars for dnsmasq (part of kubedns app) and dnsmasq daemon set. * Drop string defaults for kube_log_level as it is int and is defined in the global vars as well. * Add docs Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com> 2016-11-25 10:33:39 +00:00			`* Tune parameters for DNS related applications (dnsmasq daemon set, kubedns`
			replication controller). Those are ``dns_replicas``, ``dns_cpu_limit``,
			``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``.
			`Please note that limits must always be greater than or equal to requests.`

Add retry_stagger var for failed download/pushes. * Add the retry_stagger var to tweak push and retry time strategies. * Add large deployments related docs. Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com> 2016-09-15 09:23:27 +00:00			`For example, when deploying 200 nodes, you may want to run ansible with`
			``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.