9035: Make Cilium rolling-restart delay/timeout configurable (#9176)
See #9035
This commit is contained in:
parent
ab938602a9
commit
bbd1161147
3 changed files with 35 additions and 2 deletions
|
@ -153,3 +153,32 @@ cilium_hubble_metrics:
|
|||
```
|
||||
|
||||
[More](https://docs.cilium.io/en/v1.9/operations/metrics/#hubble-exported-metrics)
|
||||
|
||||
## Upgrade considerations
|
||||
|
||||
### Rolling-restart timeouts
|
||||
|
||||
Cilium relies on the kernel's BPF support, which is extremely fast at runtime but incurs a compilation penalty on initialization and update.
|
||||
|
||||
As a result, the Cilium DaemonSet pods can take a significant time to start, which scales with the number of nodes and endpoints in your cluster.
|
||||
|
||||
As part of cluster.yml, this DaemonSet is restarted, and Kubespray's [default timeouts for this operation](../roles/network_plugin/cilium/defaults/main.yml)
|
||||
are not appropriate for large clusters.
|
||||
|
||||
This means that you will likely want to update these timeouts to a value more in-line with your cluster's number of nodes and their respective CPU performance.
|
||||
This is configured by the following values:
|
||||
|
||||
```yaml
|
||||
# Configure how long to wait for the Cilium DaemonSet to be ready again
|
||||
cilium_rolling_restart_wait_retries_count: 30
|
||||
cilium_rolling_restart_wait_retries_delay_seconds: 10
|
||||
```
|
||||
|
||||
The total time allowed (count * delay) should be at least `($number_of_nodes_in_cluster * $cilium_pod_start_time)` for successful rolling updates. There are no
|
||||
drawbacks to making it higher and giving yourself a time buffer to accommodate transient slowdowns.
|
||||
|
||||
Note: To find the `$cilium_pod_start_time` for your cluster, you can simply restart a Cilium pod on a node of your choice and look at how long it takes for it
|
||||
to become ready.
|
||||
|
||||
Note 2: The default CPU requests/limits for Cilium pods is set to a very conservative 100m:500m which will likely yield very slow startup for Cilium pods. You
|
||||
probably want to significantly increase the CPU limit specifically if short bursts of CPU from Cilium are acceptable to you.
|
||||
|
|
|
@ -236,3 +236,7 @@ cilium_enable_bpf_clock_probe: true
|
|||
|
||||
# -- Whether to enable CNP status updates.
|
||||
cilium_disable_cnp_status_updates: true
|
||||
|
||||
# Configure how long to wait for the Cilium DaemonSet to be ready again
|
||||
cilium_rolling_restart_wait_retries_count: 30
|
||||
cilium_rolling_restart_wait_retries_delay_seconds: 10
|
||||
|
|
|
@ -14,8 +14,8 @@
|
|||
command: "{{ kubectl }} -n kube-system get pods -l k8s-app=cilium -o jsonpath='{.items[?(@.status.containerStatuses[0].ready==false)].metadata.name}'" # noqa 601
|
||||
register: pods_not_ready
|
||||
until: pods_not_ready.stdout.find("cilium")==-1
|
||||
retries: 30
|
||||
delay: 10
|
||||
retries: "{{ cilium_rolling_restart_wait_retries_count | int }}"
|
||||
delay: "{{ cilium_rolling_restart_wait_retries_delay_seconds | int }}"
|
||||
failed_when: false
|
||||
when: inventory_hostname == groups['kube_control_plane'][0]
|
||||
|
||||
|
|
Loading…
Reference in a new issue