- Use builtin task scheduling of ansible (same task on each host)
instead of manual looping on master
Benefits:
- One less play in remove-node.yml playbook
- Parralel node drain
- Drain parameters (timeout, grace period, retries,
allow_ungraceful_removal) can be adjusted separately for each node
with ansible variables
* Ensure entries for 1.23 are added for supported_versions vars
* cri-o: add support for kubernetes 1.23 but still use cri-o 1.22
* kubescheduler-config: diferentiate config versions based on kube_version
* registry: service add clusterIP, nodePort, loadBalancer support
* modify camelcase name to underscore
* Add registry service type compatibility check
* containerd: change default resolvconf_mode to host_resolvconf
* Wait for kube-apiserver to come back after pod refresh
* Handle resolv.conf gracefully
* Retain currently configured DNS entries to ensure we don't break the resolvers
* Suse uses wickedd for network management so no dhcp hooks
* Molecule: increase ansible timeout
* CI: Increase ansible timeout to 120s for Packet jobs
* Improve control plane scale flow (#13)
* Added version 1.20.10 of K8s
* Setting first_kube_control_plane to a existing one
* Setting first_kube_control_plane to a existing one
* change first_kube_master for first_kube_control_plane
* Ansible-lint changes
If trying to pull k8scsi/csi-resizer image from gcr.io, we face the error
like:
$ docker pull gcr.io/k8scsi/csi-resizer:v1.0.0
Error response from daemon: Head https://gcr.io/v2/k8scsi/csi-resizer/
manifests/v1.0.0: unknown: Project 'project:k8scsi' not found or deleted.
$
We can pull the image from quay.io instead.
This fixes the issue.
* containerd: add hashes for 1.5.8 and 1.4.12 and make 1.5.8 the new default
* containerd: make nerdctl mandatory for container_manager = containerd
* nerdctl: bump to version 0.14.0
* containerd: use nerdctl for image manipulation
* OpenSuSE: install basic nerdctl dependencies
* set ingress-nginx default terminationGracePeriodSeconds to 5 min for the drain of connection
* Add ingress_nginx_termination_grace_period_seconds at sample inventory
* Defaults: replace docker with containerd as our default container_manager
* CI: Use docker for download_localhost test
* Defaults: with container_manager=containerd we need etcd_deployment_type=host
* CI: Run weave jobs with docker
* CI: Vagrant don't download_force_cache
* CI: Fix upgrade tests
* should run compatible with old settings, this means docker
* we need to run with a distro that has at least modern containerd,
this means move from debian9 to debian10 to allow `containerd_version`
to match between 2.17 and master
* add metallb auto-assign property for main IP range & update addons.yml for sample inventory
* add new line at the end of file roles\kubernetes-apps\metallb\defaults\main.yml
* set default value for matallb_auto_assign = true
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net
* Kata-containers: use similar testing strategy as gvisor
* Kata-Containers: adjust values for 2.2.0 defaults
Make CI tests actually pass
* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
* Limit kubectl delete node to k8s nodes
This avoids the use of `kubectl delete node` when removing etcd nodes
which are not part of the cluser (separate etcd)
* Take errors into account when deleting node
There should not be error now that we're limiting the deletion to nodes
actually in the cluster
* Retrying on error
* Ensure addon-resizer 1.8.11 only effective at arch amd64.
k8s.gcr.io/addon-resizer:1.8.11 returns the amd64 image which is not executable at arm64.
Disable addon-resizer when the platform is not amd64.
When metrics-server upgrade and use addon-resizer:2.3, then revert this
commit and `image_arch` will determine the `addon_resizer_image_tag`.
* Add metrics_server_resizer architectures check
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* Containerd: download containerd from upstream instead of using distro specific packages
split runc download to separate role
make bootstrap-os role deploy container-selinux and seccomp libraries
clean up package manager provided containerd
move variables to docker role that are no longer common with containerd
* Containerd: make molecule testing more relevant
* replace ubuntu18 with ubuntu20
* add centos8 and debian11 to molecule tests
* run kubernetes/preinstall role to ensure relevancy
of test including dependency packages
* CI: adjust test scenarios for downloaded containerd
kube-bench scan outputs warning related to Calico like:
* text: "Ensure that the Container Network Interface file
permissions are set to 644 or more restrictive (Manual)"
* text: "Ensure that the Container Network Interface file
ownership is set to root:root (Manual)"
This fixes these warnings.
* netchecker: update images to 1.2.2 from Mirantis which is slightly less ancinet than the l23networks images
* Netchecker: use local etcd instead of kubernetes v1beta1 crds which are no longer suported by kube 1.22+
* Add Rocky as a known OS
* Make sure Rocky includes bootstrap-centos.yml
* Update docs with Rocky Linux
* Rocky Linux wireguard and EPEL
* Rocky Linux in the list of supported distributions
If the etcd cluster is separate and the etcd_deployment_type is "host",
there is no need for a container engine on the etcd nodes
Do not rely on a 'default(true)' filter, but define a proper default in
kubespray-defaults depending on etcd deployment method and if internal
or external etcd is used
to remove deprecation warning:
> Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
The addon-resizer container can reduce resource limits of cpu and
memory of metrics-server container in the pod, and that caused
OOMKilled.
In addition, the original metrics-server manifest doesn't contain
the addon-resizer container as [1].
So this adds metrics_server_resizer option to control the addon-resizer
container deployment and the default value is false to make it stable
for most environments.
[1]: 527679e5e8/manifests/base/deployment.yaml
"allowPrivilegeEscalation: false" blocks deploying metrics-server
on CentOS7. In addition, the original metrics-server manifest doesn't
contain it as [1]. This removes it.
[1]: 527679e5e8/manifests/base/deployment.yaml
* Kata-Containers: add 2.2.0 hashes and make default
* Kata-Containers: replace 2.1.0 with bugfix version 2.1.1
* Kata-Containers: move to q35 a more modern VM architecture as 'pc' is removed in 2.2.0
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
The path of kubeconfig should be configurable, and its default value
is /etc/kubernetes/admin.conf. Most paths of the file are configurable
but some were not. This make those configurable.
* Calico: make calico_min_version check relevant
* Calico: only check currently installed version against the oldest supported version by the previous release
On Debian 11, `ipset` just recommend `iptables` so on the system that apt is configured with `APT::Install-Recommends "0";` iptables will not install automatically.
* Fix missing file mode (risky-file-permissions)
Found this using ansible-lint.
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* Fix another missing file mode (risky-file-permissions)
This one fixes `/etc/crio/config.json`
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* CSI: update CSI snapshot CRDs
* CSI: update snapshot controller tag version with kubernetes specific versions
* CSI: allow enabling csi_snapshot_controller independent of Cinder CSI
* CSI: Align csi-snapshot-controller with upstream and use a Deployment instead of a StatefulSet
When using Calico with:
- `calico_network_backend: vxlan`,
- `calico_ipip_mode: "Never"`,
- `calico_vxlan_mode: "Always"`,
the `FelixConfiguration` object has `ipipEnabled: true`, when it should be false:
This is caused by an error in the `| bool` conversion in the install task:
when `calico_ipip_mode` is `Never`,
`{{ calico_ipip_mode != 'Never' | bool }}` evaluates to `true`:
* Fedora and RHEL use etc_t and the convention is <type_name>_t
* Docs: specify all values for preinstall_selinux_state
* CI: Add Fedora 34 with SELinux in enforcing mode
Fix task 'Cert Manager | Wait for Webhook pods become ready' failed due to webhook pods don't exist yet by using `retries..until` trick like kubernetes-sigs/kubespray#7842
This fix should be removed in the future if the kubernetes/kubernetes#83242 is resolved.
Signed-off-by: rtsp <git@rtsp.us>
Fix task 'Cert Manager | Apply ClusterIssuer manifest' failed due to service/endpoints updating delayed even though the wekhook pod status is ready.
Signed-off-by: rtsp <git@rtsp.us>
Changes:
* ClusterRole updated according to the latest manifests from
https://github.com/kubernetes/cloud-provider-vsphere
* vSphere CPI/CSI default versions bumped and
tested successfully on K8S 1.21.1
* vSphere documentation updated
Signed-off-by: Vitaliy D <vi7alya@gmail.com>
* CRI-O: Install libseccomp2 from backports on Debian 10
libseccomp2 is a required dependency of cri-o-runc package
The one provided in Debian 10 repositories is outdated
* 7816: Remove useless when condition
As this condition is handled by block
To download necessary files in advance for offline deployment,
we can see all file URLs with contrib/offline/generate_list.sh
Most URLs are downloadable, but gvisor's one is not because the
URL is a part of full URLs for gvisor.
To download gvisor's files from the URLs directory, this separates
into two URLs for runsc and the shim.
* csi-driver: Added possibility to use application credentials for cinder
* external-cloud-controller: Added env vars for openstack application credentials
* set selinux type t_etc if selinux state is enforcing
* workaround with update repo is no longer needed
remove comments about failing playbook
* grubby is not available in distros using ostree
* remove docker support because removed in fcos
update install script example with live rootfs
* do not call grubby on ostree based distro
* update docs enabling containerd on fedora coreos
* Ansible: move to Ansible 3.4.0 which uses ansible-base 2.10.10
* Docs: add a note about ansible upgrade post 2.9.x
* CI: ensure ansible is removed before ansible 3.x is installed to avoid pip failures
* Ansible: use newer ansible-lint
* Fix ansible-lint 5.0.11 found issues
* syntax issues
* risky-file-permissions
* var-naming
* role-name
* molecule tests
* Mitogen: use 0.3.0rc1 which adds support for ansible 2.10+
* Pin ansible-base to 2.10.11 to get package fix on RHEL8
* Calico: align manifests with upstream
* allow enabling typha prometheus metrics
* Calico: enable eBPF support
* manage the kubernetes-services-endpoint configmap
* Calico: document the use of eBPF dataplane
* Calico: improve checks before deployment
* enforce disabling kube-proxy when using eBPF dataplane
* ensure calico_version is supported
* Kata: add Kata 2.x checksums and adjust download urls for 2.x
* Kata: drop 1.x version which is no longer supported
* Kata: set default version 2.1.0
* Calico: add v3.19.1 hashes
* enable liveness probe for calico-kube-controllers
3.19.1
* Calico: drop support for v3.16.x
* Calico: promote v3.18.3 as default
* Override the default value of containerd's root, state, and oom_score configurations
* Add tests data for containerd_storage_dir, containerd_state_dir and containerd_oom_score variables
* add support for using ansible 2.10.x for deploying kubespray
* move dns-autoscaler-clusterrole{binding}.yml to files/ folder
* note that ansible 2.10 is now experimentally supported
* coredns: move files to templates like before #4341
* add initial MetalLB docs
* metallb allow disabling the deployment of the metallb speaker
* calico>=3.18 allow using calico to advertise service loadbalancer IPs
* Document the use of MetalLB and Calico
* clean MetalLB docs
Since K8S 1.21, BoundServiceAccountTokenVolume feature gate is in beta stage, thus activated by default (anyone who follows CSI guidelines has enabled AllAlpha and faced the issue before 1.21).
With this feature, SA tokens are regenerated every hour.
As a consequence for Calico CNI, token in /etc/cni/net.d/calico-kubeconfig copied from /var/run/secrets/kubernetes.io/serviceaccount in install-cni initContainer expires after one hour and any pod creation fails due to unauthorization.
Calico pods need to be restarted so that /etc/cni/net.d/calico-kubeconfig is updated with the new SA token.
follow new naming conventions for gcr's coredns image.
starting from 1.21 kubeadm assumes it to be `coredns/coredns`:
this causes the kubeadm deployment being unable to pull image, beacuse `v`
was also added in image tag, until the role `kubernetes-apps` ovverides
it with the old name, which is only compatible with <=1.7.
Backward comptability with kubeadm <=1.20 is mantained checking
kubernetes version and falling back to old names (`coredns:1.xx`) when
the version is less than 1.21
* rename ansible groups to use _ instead of -
k8s-cluster -> k8s_cluster
k8s-node -> k8s_node
calico-rr -> calico_rr
no-floating -> no_floating
Note: kube-node,k8s-cluster groups in upgrade CI
need clean-up after v2.16 is tagged
* ensure old groups are mapped to the new ones
* crio: add supported versions 1.20 and 1.21 and align default with k8s version
* cri-o: drop versions 1.17 and 1.18 from version matrix
* update note on cri-o version alignment
* calico: drop support for version 3.15
* drop check for calico version >= 3.3, we are at 3.16 minimum now
* we moved to calico 3.16+ so we can default to /opt/cni/bin/install
* AlmaLinux: ansible>2.9.19 is needed to know about AlmaLinux
* AlmaLinux: identify as a centos derrivative
* AlmaLinux: add AlmaLinux to checks for CentOS
* Use ansible_os_family to compare family and not distribution
As the official document[1], the parameter keepcache should be
'0' or '1' as string. To avoid the following warning message,
this fixes the parameter value:
[WARNING]: The value False (type bool) in a string field was
converted to u'False' (type string). If this does not look
like what you expect, quote the entire value to ensure it
does not change.
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_repository_module.html
* Add containerd_extra_args
This is useful for custom containerd config, e.g. auth
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
* Make containerd config.toml mode 0640
It may contain sensitive information like password
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
This PR is to move the cilium kvstore options to the configmap
rather than specifying them in the deployment as args. This
is not technically necessary but keeping all the options in
one place is probably not a bad idea.
Tested with cilium 1.9.5.
When attempting a fresh install without cilium_ipsec_enabled I ran
into the following error:
failed: [k8m01] (item={'name': 'cilium', 'file': 'cilium-secret.yml', 'type': 'secret', 'when': 'cilium_ipsec_enabled'}) =>
{"ansible_loop_var": "item", "changed": false, "item": {"file": "cilium-secret.yml", "name": "cilium", "type": "secret",
"when": "cilium_ipsec_enabled"},"msg": "AnsibleUndefinedVariable: 'cilium_ipsec_key' is undefined"}
Moving the when condition from the item level to the task level solved
the issue.
* Add KubeSchedulerConfiguration for k8s 1.19 and up
With release of version 1.19.0 of kubernetes KubeSchedulerConfiguration
was graduated to beta. It allows to extend different stages of
scheduling with profiles. Such effect is achieved by using plugins and
extensions.
This patch adds KubeSchedulerConfiguration for versions 1.19 and later.
Configuration is set to k8s defaults or to kubespray vars. Moving those
defaults to new vars will be done in following patch.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* KubeSchedulerConfiguration: add defaults
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
Starting with Cilium v1.9 the default ipam mode has changed to "Cluster
Scope". See:
https://docs.cilium.io/en/v1.9/concepts/networking/ipam/
With this ipam mode Cilium handles assigning subnets to nodes to use
for pod ip addresses. The default Kubespray deploy uses the Kube
Controller Manager for this (the --allocate-node-cidrs
kube-controller-manager flag is set). This makes the proper ipam mode
for kubespray using cilium v1.9+ "kubernetes".
Tested with Cilium 1.9.5.
This PR also mounts the cilium-config ConfigMap for this variable
to be read properly.
In the future we can probably remove the kvstore and kvstore-opt
Cilium Operator args since they can be in the ConfigMap. I will tackle
that after this merges.
When upgrading cilium from 1.8.8 to 1.9.5 I ran into the following
error:
level=error msg="Unable to update CRD" error="customresourcedefinitions.apiextensions.k8s.io
\"ciliumnodes.cilium.io\" is forbidden: User \"system:serviceaccount:kube-system:cilium-operator\"
cannot update resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the
cluster scope" name=CiliumNode/v2 subsys=k8s
The fix was to add the update verb to the clusterrole. I also added
create to match the clusterrole created by the cilium helm chart.
DNSSEC is off by default on ubuntu/bionic64 (18.04) as per resolved.conf(5).
These tasks are artefacts of obsolete infra configuration, and no longer needed.
Further removing these tasks resolves the issue that the tasks always reports
'changed' and bounces systemd-resolved unneccesarily, even if there was no
actual modification of /etc/systemd/resolved.conf.
* Remove contrib/vault
This is marked as broken since 2018 / 3dcb914607
This still reference apiserver.pem, not used since ddffdb63bf
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Finish nuking vault from the codebase
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This replaces kube-master with kube_control_plane because of [1]:
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
NOTE: The reason why this changes it to kube_control_plane not
kube-control-plane is for valid group names on ansible.
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
While at it remove force_certificate_regeneration
This boolean only forced the renewal of the apiserver certs
Either manually use k8s-certs-renew.sh or set auto_renew_certificates
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add crun download_url and checksum
* Change versioning format to crun native versioning
* Download crun using download_file.yml
* Get crun version from download defaults
* Delegate crun binary copy task to crun role
* Download Calico KDD CRDs
* Replace kustomize with lineinfile and use ansible assemble module
* Replace find+lineinfile by sed in shell module to avoid nested loop
* add condition on sed
* use block for kdd tasks + remove supernumerary kdd manifest apply in start "Start Calico resources"
* add nodeselector and tolerations for metallb
* remove unnecessary commented lines in metallb template
* set default speaker toleration to match original manifest
When privileged is enabled for a container, all the `/dev/*` block
devices from the host are mounted into the guest. The
`privileged_without_host_devices` flag prevents host devices from
being passed to privileged containers.
More information:
* https://github.com/containerd/cri/pull/1225
* 1d0f68156b
The important action in kubeadm-version.yml is the templating of the configuration,
not finding / setting the version
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
kubeadm is the default for a long time now,
and admin.conf is created by it, so let kubeadm handle it
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Using `kubeadm init phase kubeconfig all` breaks kubelet client certificate rotation
as we are missing `kubeadm init phase kubelet-finalize all` to point to `kubelet-client-current.pem`
kubeconfig format is stable so let's just use lineinfile,
this will avoid other future breakage
This revert to the logic before 6fe2248314
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
On CentOS 8 they seem to be ignored by default, but better be extra safe
This also make it easy to exclude other network plugin interfaces
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* use external_openstack_lbaas_use_octavia for template openstack-cloud-config
* Delete external_openstack_lbaas_use_octavia from default values. Added description and default values of variables to docs
* markdown fix
* make this simple
* set external_openstack_lbaas_use_octavia in default values
* duplicated variable in doc
Since a790935d02 all proxy users
should be properly configured
Now when you have *_PROXY vars in your environment it can leads to failure
if NO_PROXY is not correct, or to persistent configuration changes
as seen with kubeadm in 1c5391dda7
Instead of playing constant whack-a-bug, inject empty *_PROXY vars everywhere
at the play level, and override at the task level when needed
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Move proxy_env to kubespray-defaults/defaults
There is no reasons to use set_facts here
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Ensure kubeadm doesn't use proxy
*_proxy variables might be present in the environment (/etc/environment, bash profile, ...)
When this is the case we end up with those proxy configuration in /etc/kubernetes/manifests/kube-*.yaml manifests
We cannot unset env variables, but kubeadm is nice enough to ignore empty vars
93d288e2a4/cmd/kubeadm/app/util/env.go (L27)
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Ubuntu 18.04 crio package ships with 'mountopt = "nodev,metacopy=on"'
even if GA kernel is 4.15 (HWE Kernel can be more recent)
Fedora package ships without metacopy=on
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
By default Ansible stat module compute checksum, list extended attributes and find mime type
To find all stat invocations that really use one of those:
git grep -F stat. | grep -vE 'stat.(islnk|exists|lnk_source|writeable)'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
`containerd.io` is the companion package of `docker-ce` and is the
proper package name. This is needed to avoid apt upgrade/dist-upgrade
from breaking kubernetes.
Running remove-node.yml tasks for clean up cluster on Fedora CoreOS.
The task failed to restart network daemon (task name: "reset | Restart network").
Fedora CoreOS is essentially using NetworkManager, but this task returns network.
Signed-off-by: Takashi IIGUNI <iiguni.tks@gmail.com>
* Add unique annotation on coredns deployment and only remove existing deployment if annotation is missing.
* Ignore errors when gathering coredns deployment details to handle case where it doesn't exist yet
* Remove run_once, deletegate_to and add to when statement
* Added force_etcd_cert_refresh var to maintain existing functionality. Broke out etcd node cert syncing from member and admin cert sync logic. Now first etcd will sync node certs to other etcd members on every run to keep all etcds up to date after adding additional worker nodes to the cluster
* Updated etcd cert check tasks to better detect when new certificates need to be generated
* Move usage of force_etcd_cert_refresh var to gen_certs fact set
* Force etcd cert generation per server if force_etcd_cert_refresh is set to true
* Include gathering of node certs even if k8s-cluster member and in etcd group.
* Removed run_once due to when statement
Helm v3.5.2 is a security (patch) release. Users are strongly
recommended to update to this release. It fixes two security issues in
upstream dependencies and one security issue in the Helm codebase.
See https://github.com/helm/helm/releases/tag/v3.5.2
This makes the docker role work the same as the containerd role.
Being able to override this is needed when you have your own debian
repository. E.g. when performing an airgapped installation
* update local-path-storage config template to version v0.0.19
* changes local_path_provisioner image tag to v0.0.19
* removes copy paste example from rancher local-path-provisioner repo
According to the following recommendation, this moves the directory
to control-plane:
The Kubernetes project is moving away from wording that is considered
offensive. A new working group WG Naming was created to track this work,
and the word "master" was declared as offensive. A proposal was formalized
for replacing the word "master" with "control plane".
Previous check for presence of NM assumed "systemctl show
NetworkManager" would exit with a nonzero status code, which seems not
the case anymore with recent Flatcar Container Linux.
This new check also checks the activeness of network manager, as
`is-active` implies presence.
Signed-off-by Jorik Jonker <jorik@kippendief.biz>
This was introduced in 143e2272ff
Extra repo is enabled by default in CentOS, and is not the right repo for EL8
Instead of adding a CentOS repo to RHEL, enable the needed RHEL repos with rhsm_repository
For RHEL 7, we need the "extras" repo for container-selinux
For RHEL 8, we need the "appstream" repo for container-selinux, ipvsadm and socat
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>