Fixed issue #7112.  Created new API Server vars that replace defunct Controller Manager one (#7114)

Signed-off-by: Brendan Holmes <5072156+holmesb@users.noreply.github.com>
This commit is contained in:
holmesb 2021-01-08 15:20:53 +00:00 committed by GitHub
parent ab2bfd7f8c
commit b0ad8ec023
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 25 additions and 13 deletions

View file

@ -43,8 +43,10 @@ attempts to set a status of node.
At the same time Kubernetes controller manager will try to check At the same time Kubernetes controller manager will try to check
`nodeStatusUpdateRetry` times every `--node-monitor-period` of time. After `nodeStatusUpdateRetry` times every `--node-monitor-period` of time. After
`--node-monitor-grace-period` it will consider node unhealthy. It will remove `--node-monitor-grace-period` it will consider node unhealthy. Pods will then be rescheduled based on the
its pods based on `--pod-eviction-timeout` [Taint Based Eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions)
timers that you set on them individually, or the API Server's global timers:`--default-not-ready-toleration-seconds` &
``--default-unreachable-toleration-seconds``.
Kube proxy has a watcher over API. Once pods are evicted, Kube proxy will Kube proxy has a watcher over API. Once pods are evicted, Kube proxy will
notice and will update iptables of the node. It will remove endpoints from notice and will update iptables of the node. It will remove endpoints from
@ -57,12 +59,14 @@ services so pods from failed node won't be accessible anymore.
If `-node-status-update-frequency` is set to **4s** (10s is default). If `-node-status-update-frequency` is set to **4s** (10s is default).
`--node-monitor-period` to **2s** (5s is default). `--node-monitor-period` to **2s** (5s is default).
`--node-monitor-grace-period` to **20s** (40s is default). `--node-monitor-grace-period` to **20s** (40s is default).
`--pod-eviction-timeout` is set to **30s** (5m is default) `--default-not-ready-toleration-seconds` and ``--default-unreachable-toleration-seconds`` are set to **30**
(300 seconds is default). Note these two values should be integers representing the number of seconds ("s" or "m" for
seconds\minutes are not specified).
In such scenario, pods will be evicted in **50s** because the node will be In such scenario, pods will be evicted in **50s** because the node will be
considered as down after **20s**, and `--pod-eviction-timeout` occurs after considered as down after **20s**, and `--default-not-ready-toleration-seconds` or
**30s** more. However, this scenario creates an overhead on etcd as every node ``--default-unreachable-toleration-seconds`` occur after **30s** more. However, this scenario creates an overhead on
will try to update its status every 2 seconds. etcd as every node will try to update its status every 2 seconds.
If the environment has 1000 nodes, there will be 15000 node updates per If the environment has 1000 nodes, there will be 15000 node updates per
minute which may require large etcd containers or even dedicated nodes for etcd. minute which may require large etcd containers or even dedicated nodes for etcd.
@ -75,7 +79,8 @@ minute which may require large etcd containers or even dedicated nodes for etcd.
## Medium Update and Average Reaction ## Medium Update and Average Reaction
Let's set `-node-status-update-frequency` to **20s** Let's set `-node-status-update-frequency` to **20s**
`--node-monitor-grace-period` to **2m** and `--pod-eviction-timeout` to **1m**. `--node-monitor-grace-period` to **2m** and `--default-not-ready-toleration-seconds` and
``--default-unreachable-toleration-seconds`` to **60**.
In that case, Kubelet will try to update status every 20s. So, it will be 6 * 5 In that case, Kubelet will try to update status every 20s. So, it will be 6 * 5
= 30 attempts before Kubernetes controller manager will consider unhealthy = 30 attempts before Kubernetes controller manager will consider unhealthy
status of node. After 1m it will evict all pods. The total time will be 3m status of node. After 1m it will evict all pods. The total time will be 3m
@ -90,9 +95,9 @@ etcd updates per minute.
## Low Update and Slow reaction ## Low Update and Slow reaction
Let's set `-node-status-update-frequency` to **1m**. Let's set `-node-status-update-frequency` to **1m**.
`--node-monitor-grace-period` will set to **5m** and `--pod-eviction-timeout` `--node-monitor-grace-period` will set to **5m** and `--default-not-ready-toleration-seconds` and
to **1m**. In this scenario, every kubelet will try to update the status every ``--default-unreachable-toleration-seconds`` to **60**. In this scenario, every kubelet will try to update the status
minute. There will be 5 * 5 = 25 attempts before unhealthy status. After 5m, every minute. There will be 5 * 5 = 25 attempts before unhealthy status. After 5m,
Kubernetes controller manager will set unhealthy status. This means that pods Kubernetes controller manager will set unhealthy status. This means that pods
will be evicted after 1m after being marked unhealthy. (6m in total). will be evicted after 1m after being marked unhealthy. (6m in total).

View file

@ -30,7 +30,8 @@ For a large scaled deployments, consider the following configuration changes:
* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet. * Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
``kube_controller_node_monitor_grace_period``, ``kube_controller_node_monitor_grace_period``,
``kube_controller_node_monitor_period``, ``kube_controller_node_monitor_period``,
``kube_controller_pod_eviction_timeout`` for better Kubernetes reliability. ``kube_apiserver_pod_eviction_not_ready_timeout_seconds`` &
``kube_apiserver_pod_eviction_unreachable_timeout_seconds`` for better Kubernetes reliability.
Check out [Kubernetes Reliability](kubernetes-reliability.md) Check out [Kubernetes Reliability](kubernetes-reliability.md)
* Tune network prefix sizes. Those are ``kube_network_node_prefix``, * Tune network prefix sizes. Those are ``kube_network_node_prefix``,

View file

@ -86,9 +86,10 @@ audit_webhook_batch_max_wait: 1s
kube_controller_node_monitor_grace_period: 40s kube_controller_node_monitor_grace_period: 40s
kube_controller_node_monitor_period: 5s kube_controller_node_monitor_period: 5s
kube_controller_pod_eviction_timeout: 5m0s
kube_controller_terminated_pod_gc_threshold: 12500 kube_controller_terminated_pod_gc_threshold: 12500
kube_apiserver_request_timeout: "1m0s" kube_apiserver_request_timeout: "1m0s"
kube_apiserver_pod_eviction_not_ready_timeout_seconds: "300"
kube_apiserver_pod_eviction_unreachable_timeout_seconds: "300"
# 1.10+ admission plugins # 1.10+ admission plugins
kube_apiserver_enable_admission_plugins: [] kube_apiserver_enable_admission_plugins: []

View file

@ -100,6 +100,12 @@ certificatesDir: {{ kube_cert_dir }}
imageRepository: {{ kube_image_repo }} imageRepository: {{ kube_image_repo }}
apiServer: apiServer:
extraArgs: extraArgs:
{% if kube_apiserver_pod_eviction_not_ready_timeout_seconds is defined %}
default-not-ready-toleration-seconds: "{{ kube_apiserver_pod_eviction_not_ready_timeout_seconds }}"
{% endif %}
{% if kube_apiserver_pod_eviction_unreachable_timeout_seconds is defined %}
default-unreachable-toleration-seconds: "{{ kube_apiserver_pod_eviction_unreachable_timeout_seconds }}"
{% endif %}
{% if kube_api_anonymous_auth is defined %} {% if kube_api_anonymous_auth is defined %}
anonymous-auth: "{{ kube_api_anonymous_auth }}" anonymous-auth: "{{ kube_api_anonymous_auth }}"
{% endif %} {% endif %}
@ -256,7 +262,6 @@ controllerManager:
extraArgs: extraArgs:
node-monitor-grace-period: {{ kube_controller_node_monitor_grace_period }} node-monitor-grace-period: {{ kube_controller_node_monitor_grace_period }}
node-monitor-period: {{ kube_controller_node_monitor_period }} node-monitor-period: {{ kube_controller_node_monitor_period }}
pod-eviction-timeout: {{ kube_controller_pod_eviction_timeout }}
node-cidr-mask-size: "{{ kube_network_node_prefix }}" node-cidr-mask-size: "{{ kube_network_node_prefix }}"
profiling: "{{ kube_profiling }}" profiling: "{{ kube_profiling }}"
terminated-pod-gc-threshold: "{{ kube_controller_terminated_pod_gc_threshold }}" terminated-pod-gc-threshold: "{{ kube_controller_terminated_pod_gc_threshold }}"