When etcd exceeds its memory limit, it becomes useless but keeps running.
We should let OOM killer kill etcd process in the container, so systemd can spot
the problem and restart etcd according to "Restart" setting in etcd.service unit file.
If OOME problem keep repeating, i.e. it happens every single restart,
systemd will eventually back off and stop restarting it anyway.
--restart=on-failure:5 in this file has no effect because memory allocation error
doesn't by itself cause the process to die
Related: https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/templates/etcd-docker.service.j2
This kind of reverts a change introduced in #1860.
Some installation are failing to authenticate with peers due to
etcd picking up/resoling the wrong node.
By setting 'etcd_peer_client_auth' to "False" you can disable peer client cert
authentication.
Signed-off-by: Sébastien Han <seb@redhat.com>
* Defaults for apiserver_loadbalancer_domain_name
When loadbalancer_apiserver is defined, use the
apiserver_loadbalancer_domain_name with a given default value.
Fix unconsistencies for checking if apiserver_loadbalancer_domain_name
is defined AND using it with a default value provided at once.
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
* Define defaults for LB modes in common defaults
Adjust the defaults for apiserver_loadbalancer_domain_name and
loadbalancer_apiserver_localhost to come from a single source, which is
kubespray-defaults. Removes some confusion and simplefies the code.
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
* Adding yaml linter to ci check
* Minor linting fixes from yamllint
* Changing CI to install python pkgs from requirements.txt
- adding in a secondary requirements.txt for tests
- moving yamllint to tests requirements
- Run docker run from script rather than directly from systemd target
- Refactoring styling/templates
Signed-off-by: Sergii Golovatiuk <sgolovatiuk@mirantis.com>
When a apiserver_loadbalancer_domain_name is added to the Openssl.conf
the counter gets not increased correctly. This didnt seem to have an
effect at the current kargo version.
Reduce election timeout to 5000ms (was 10000ms)
Raise heartbeat interval to 250ms (was 100ms)
Remove etcd cpu share (was 300)
Make etcd_cpu_limit and etcd_memory_limit optional.
* Drop linux capabilities for unprivileged containerized
worlkoads Kargo configures for deployments.
* Configure required securityContext/user/group/groups for kube
components' static manifests, etcd, calico-rr and k8s apps,
like dnsmasq daemonset.
* Rework cloud-init (etcd) users creation for CoreOS.
* Fix nologin paths, adjust defaults for addusers role and ensure
supplementary groups membership added for users.
* Add netplug user for network plugins (yet unused by privileged
networking containers though).
* Grant the kube and netplug users read access for etcd certs via
the etcd certs group.
* Grant group read access to kube certs via the kube cert group.
* Remove priveleged mode for calico-rr and run it under its uid/gid
and supplementary etcd_cert group.
* Adjust docs.
* Align cpu/memory limits and dropped caps with added rkt support
for control plane.
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
* Add restart for weave service unit
* Reuse docker_bin_dir everythere
* Limit systemd managed docker containers by CPU/RAM. Do not configure native
systemd limits due to the lack of consensus in the kernel community
requires out-of-tree kernel patches.
Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>
etcd facts are generated in kubernetes/preinstall, so etcd nodes need
to be evaluated first before the rest of the deployment.
Moved several directory facts from kubernetes/node to
kubernetes/preinstall because they are not backward dependent.
Improved docker reload command to wait for etcd to be
up before proceeding. Switched reload to run restart
because it can't reload if it is not guaranteed to be
in running state.