* Added cilium support
* Fix typo in debian test config
* Remove empty lines
* Changed cilium version from <latest> to <v1.0.0-rc3>
* Add missing changes for cilium
* Add cilium to CI pipeline
* Fix wrong file name
* Check kernel version for cilium
* fixed ci error
* fixed cilium-ds.j2 template
* added waiting for cilium pods to run
* Fixed missing EOF
* Fixed trailing spaces
* Fixed trailing spaces
* Fixed trailing spaces
* Fixed too many blank lines
* Updated tolerations,annotations in cilium DS template
* Set cilium_version to iptables-1.9 to see if bug is fixed in CI
* Update cilium image tag to v1.0.0-rc4
* Update Cilium test case CI vars filenames
* Add optional prometheus flag, adjust initial readiness delay
* Update README.md with cilium info
When etcd exceeds its memory limit, it becomes useless but keeps running.
We should let OOM killer kill etcd process in the container, so systemd can spot
the problem and restart etcd according to "Restart" setting in etcd.service unit file.
If OOME problem keep repeating, i.e. it happens every single restart,
systemd will eventually back off and stop restarting it anyway.
--restart=on-failure:5 in this file has no effect because memory allocation error
doesn't by itself cause the process to die
Related: https://github.com/kubernetes-incubator/kubespray/blob/master/roles/etcd/templates/etcd-docker.service.j2
This kind of reverts a change introduced in #1860.
Even though there it kubeadm_token_ttl=0 which means that kubeadm token never expires, it is not present in `kubeadm token list` after cluster is provisioned (at least after it is running for some time) and there is issue regarding this https://github.com/kubernetes/kubeadm/issues/335, so we need to create a new temporary token during the cluster upgrade.
Ansible automatically installs the python-apt package when using
the 'apt' Ansible module, if python-apt is not present. This patch
removes the (unneeded) explicit installation in the Kubespray
'preinstall' role.
In some installation, it can take up to 3sec to get the value. Retrying
for 5 sec will ensure the command won't return 1.
Signed-off-by: Sébastien Han <seb@redhat.com>
* allow installs to not have hostname overriden with fqdn from inventory
* calico-config no longer requires local as and will default to global
* when cloudprovider is not defined, use the inventory_hostname for cni-calico
* allow reset to not restart network (buggy nodes die with this cmd)
* default kube_override_hostname to inventory_hostname instead of ansible_hostname
Cloud resolvers are mandatory for hosts on GCE and OpenStack
clouds. The 8.8.8.8 alternative resolver was dropped because
there is already a default nameserver. The new var name
reflects the purpose better.
Also restart apiserver when modifying dns settings.
If you configure your external loadbalancer to do a simple tcp pass-through to the api servers, and you do not use a DNS FQDN but just the ip, then you need to add the ip adress to the certificates too.
Example config:
```
## External LB example config
apiserver_loadbalancer_domain_name: "10.50.63.10"
loadbalancer_apiserver:
address: 10.50.63.10
port: 8383
```
Some installation are failing to authenticate with peers due to
etcd picking up/resoling the wrong node.
By setting 'etcd_peer_client_auth' to "False" you can disable peer client cert
authentication.
Signed-off-by: Sébastien Han <seb@redhat.com>
kube-proxy is complaining of missing modules at startup. There is a plan
to also support an LVS implementation of kube-proxy in additon to
userspace and iptables
Auto configure API access endpoint with a custom bind IP, if provided.
Fix HA docs' http URLs are https in fact, clarify the insecure vs secure
API access modes as well.
Closes: #issues/2051
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
Update checksum for kubeadm
Use v1.9.0 kubeadm params
Include hash of ca.crt for kubeadm join
Update tag for testing upgrades
Add workaround for testing upgrades
Remove scale CI scenarios because of slow inventory parsing
in ansible 2.4.x.
Change region for tests to us-central1 to
improve ansible performance
Starting with Kubernetes v1.8.4, kubelet ignores the AWS cloud
provider string and uses the override hostname, which fails
Node admission checks.
Fixes#2094
The search line in /etc/resolv.conf could have
multiple spaces or tabs between domains.
split(' ') will give wrong results in some case,
use split() without argument instead.
e.g.
>>> 'domain.tld cluster.tld '.split(' ')
['domain.tld\tcluster.tld', '']
>>> 'domain.tld cluster.tld '.split()
['domain.tld', 'cluster.tld']
As we have seen with other containers, sometimes container removal fails on the first attempt due to some Docker bugs. Retrying typically corrects the issue.
Use a etcd-initer init container to generate etcd args, it determines
etcd name by comparing its ip and etcd cluster ips. This way will
make etcd configuration independent to the ansible templating so
that could be easier on adding master nodes.
Putting contiv etcd and etcd-proxy into the same daemonset and manage
the difference by a env file is not good for scaling (adding nodes).
This commit split them into two daemonsets so that when adding nodes,
k8s could automatically starting a etcd-proxy on new nodes without need
to run related play that putting env file.
* Remove the network device created by the flannel
Remove the network device created by the flannel
* Modify flannel.1 device path
Modify flannel.1 device path
* remove trailing spaces
This allows `kube_apiserver_insecure_port` to be set to 0 (disabled).
Rework of #1937 with kubeadm support
Also, fixed an issue in `kubeadm-migrate-certs` where the old apiserver cert was copied as the kubeadm key
* Allow setting --bind-address for apiserver hyperkube
This is required if you wish to configure a loadbalancer (e.g haproxy)
running on the master nodes without choosing a different port for the
vip from that used by the API - in this case you need the API to bind to
a specific interface, then haproxy can bind the same port on the VIP:
root@overcloud-controller-0 ~]# netstat -taupen | grep 6443
tcp 0 0 192.168.24.6:6443 0.0.0.0:* LISTEN 0 680613 134504/haproxy
tcp 0 0 192.168.24.16:6443 0.0.0.0:* LISTEN 0 653329 131423/hyperkube
tcp 0 0 192.168.24.16:6443 192.168.24.16:58404 ESTABLISHED 0 652991 131423/hyperkube
tcp 0 0 192.168.24.16:58404 192.168.24.16:6443 ESTABLISHED 0 652986 131423/hyperkube
This can be achieved e.g via:
kube_apiserver_bind_address: 192.168.24.16
* Address code review feedback
* Update kube-apiserver.manifest.j2
* Add Contiv support
Contiv is a network plugin for Kubernetes and Docker. It supports
vlan/vxlan/BGP/Cisco ACI technologies. It support firewall policies,
multiple networks and bridging pods onto physical networks.
* Update contiv version to 1.1.4
Update contiv version to 1.1.4 and added SVC_SUBNET in contiv-config.
* Load openvswitch module to workaround on CentOS7.4
* Set contiv cni version to 0.1.0
Correct contiv CNI version to 0.1.0.
* Use kube_apiserver_endpoint for K8S_API_SERVER
Use kube_apiserver_endpoint as K8S_API_SERVER to make contiv talks
to a available endpoint no matter if there's a loadbalancer or not.
* Make contiv use its own etcd
Before this commit, contiv is using a etcd proxy mode to k8s etcd,
this work fine when the etcd hosts are co-located with contiv etcd
proxy, however the k8s peering certs are only in etcd group, as a
result the etcd-proxy is not able to peering with the k8s etcd on
etcd group, plus the netplugin is always trying to find the etcd
endpoint on localhost, this will cause problem for all netplugins
not runnign on etcd group nodes.
This commit make contiv uses its own etcd, separate from k8s one.
on kube-master nodes (where net-master runs), it will run as leader
mode and on all rest nodes it will run as proxy mode.
* Use cp instead of rsync to copy cni binaries
Since rsync has been removed from hyperkube, this commit changes it
to use cp instead.
* Make contiv-etcd able to run on master nodes
* Add rbac_enabled flag for contiv pods
* Add contiv into CNI network plugin lists
* migrate contiv test to tests/files
Signed-off-by: Cristian Staretu <cristian.staretu@gmail.com>
* Add required rules for contiv netplugin
* Better handling json return of fwdMode
* Make contiv etcd port configurable
* Use default var instead of templating
* roles/download/defaults/main.yml: use contiv 1.1.7
Signed-off-by: Cristian Staretu <cristian.staretu@gmail.com>
Move RS to deployment so no need to take care of the revision history
limits :
- Delete the old RS
- Make Calico manifest a deployment
- move deployments to apps/v1beta2 API since Kubernetes 1.8
* Defaults for apiserver_loadbalancer_domain_name
When loadbalancer_apiserver is defined, use the
apiserver_loadbalancer_domain_name with a given default value.
Fix unconsistencies for checking if apiserver_loadbalancer_domain_name
is defined AND using it with a default value provided at once.
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
* Define defaults for LB modes in common defaults
Adjust the defaults for apiserver_loadbalancer_domain_name and
loadbalancer_apiserver_localhost to come from a single source, which is
kubespray-defaults. Removes some confusion and simplefies the code.
Signed-off-by: Bogdan Dobrelya <bogdando@mail.ru>
Thought this wasn't required at first but I forgot there's no auto flush at the end of these tasks since the `kubernetes/master` role is not the end of the play.
* Fixes an issue where apiserver and friends (controller manager, scheduler) were prevented from restarting after manifests/secrets are changed. This occurred when a replaced kubelet doesn't reconcile new master manifests, which caused old master component versions to linger during deployment. In my case this was causing upgrades from k8s 1.6/1.7 -> k8s 1.8 to fail
* Improves transitions from kubelet container to host kubelet by preventing issues where kubelet container reappeared during the deployment
This allows `kube_apiserver_insecure_port` to be set to 0 (disabled). It's working, but so far I have had to:
1. Make the `uri` module "Wait for apiserver up" checks use `kube_apiserver_port` (HTTPS)
2. Add apiserver client cert/key to the "Wait for apiserver up" checks
3. Update apiserver liveness probe to use HTTPS ports
4. Set `kube_api_anonymous_auth` to true to allow liveness probe to hit apiserver's /healthz over HTTPS (livenessProbes can't use client cert/key unfortunately)
5. RBAC has to be enabled. Anonymous requests are in the `system:unauthenticated` group which is granted access to /healthz by one of RBAC's default ClusterRoleBindings. An equivalent ABAC rule could allow this as well.
Changes 1 and 2 should work for everyone, but 3, 4, and 5 require new coupling of currently independent configuration settings. So I also added a new settings check.
Options:
1. The problem goes away if you have both anonymous-auth and RBAC enabled. This is how kubeadm does it. This may be the best way to go since RBAC is already on by default but anonymous auth is not.
2. Include conditional templates to set a different liveness probe for possible combinations of `kube_apiserver_insecure_port = 0`, RBAC, and `kube_api_anonymous_auth` (won't be possible to cover every case without a guaranteed authorizer for the secure port)
3. Use basic auth headers for the liveness probe (I really don't like this, it adds a new dependency on basic auth which I'd also like to leave independently configurable, and it requires encoded passwords in the apiserver manifest)
Option 1 seems like the clear winner to me, but is there a reason we wouldn't want anonymous-auth on by default? The apiserver binary defaults anonymous-auth to true, but kubespray's default was false.
* Change deprecated vagrant ansible flag 'sudo' to 'become'
* Emphasize, that the name of the pip_pyton_modules is only considered in coreos
* Remove useless unused variable
* Fix warning when jinja2 template-delimiters used in when statement
There is no need for jinja2 template-delimiters like {{ }} or {% %}
any more. They can just be omitted as described in https://github.com/ansible/ansible/issues/22397
* Fix broken link in getting-started guide
* Change deprecated vagrant ansible flag 'sudo' to 'become'
* Workaround ansible bug where access var via dict doesn't get real value
When accessing a variable via it's name "{{ foo }}" its value is
retrieved. But when the variable value is retrieved via the vars-dict
"{{ vars['foo'] }}" this doesn't resolve the expression of the variable
any more due to a bug. So e.g. a expression foo="{{ 1 == 1 }}" isn't
longer resolved but just returned as string "1 == 1".
* Make file yamllint complient
Some time ago I think the hardcoded `/var/lib/docker` was required, but kubelet running in a container has been aware of the Docker path since at least as far back as k8s 1.6.
Without this change, you see a large number of errors in the kubelet logs if you installed with a non-default `docker_daemon_graph`
This allows overriding of apt repo endpoints when internet sources are not accessible. Additionally, switch to using the dockerproject.org gpg key url for apt instead of keyservers.net
* Fix broken CI jobs
Adjust image and image_family scenarios for debian.
Checkout CI file for upgrades
* add debugging to file download
* Fix download for alternate playbooks
* Update ansible ssh args to force ssh user
* Update sync_container.yml
* Refactor downloads to use download role directly
Also disable fact delegation so download delegate works acros OSes.
* clean up bools and ansible_os_family conditionals
* Update main.yml
Needs to set up resolv.conf before updating Yum cache otherwise no name resolution available (resolv.conf empty).
* Update main.yml
Removing trailing spaces
* Add possibility to insert more ip adresses in certificates
* Add newline at end of files
* Move supp ip parameters to k8s-cluster group file
* Add supplementary addresses in kubeadm master role
* Improve openssl indexes
* don't try to install this rpm on fedora atomic
* add docker 1.13.1 for fedora
* built-in docker unit file is sufficient, as tested on both fedora and centos atomic
* Change file used to check kubeadm upgrade method
Test for ca.crt instead of admin.conf because admin.conf
is created during normal deployment.
* more fixes for upgrade
In 1.8, the Node authorization mode should be listed first to
allow kubelet to access secrets. This seems to only impact
environments with cloudprovider enabled.
* Changre raw execution to use yum module
Changed raw exection to use yum module provided by Ansible.
* Replace ansible_ssh_* by ansible_*
Ansible 2.0 has deprecated the “ssh” from ansible_ssh_user, ansible_ssh_host, and ansible_ssh_port to become ansible_user, ansible_host, and ansible_port. If you are using a version of Ansible prior to 2.0, you should continue using the older style variables (ansible_ssh_*). These shorter variables are ignored, without warning, in older versions of Ansible.
I am not sure about the broader impact of this change. But I have seen on the requirements the version required is ansible>=2.4.0.
http://docs.ansible.com/ansible/latest/intro_inventory.html
This role only support Red Hat type distros and is not maintained
or used by many users. It should be removed because it creates
feature disparity between supported OSes and is not maintained.
* Rename dns_server to dnsmasq_dns_server so that it includes role prefix
as the var name is generic and conflicts when integrating with existing ansible automation.
* Enable selinux state to be configurable with new var preinstall_selinux_state
PID namespace sharing is disabled only in Kubernetes 1.7.
Explicitily enabling it by default could help reduce unexpected
results when upgrading to or downgrading from 1.7.
The value cannot be determined properly via local facts, so
checking k8s api is the most reliable way to look up what hostname
is used when using a cloudprovider.
This follows pull request #1677, adding the cgroup-driver
autodetection also for kubeadm way of deploying.
Info about this and the possibility to override is added to the docs.
Red Hat family platforms run docker daemon with `--exec-opt
native.cgroupdriver=systemd`. When kubespray tried to start kubelet
service, it failed with:
Error: failed to run Kubelet: failed to create kubelet: misconfiguration: kubelet cgroup driver: "cgroupfs" is different from docker cgroup driver: "systemd"
Setting kubelet's cgroup driver to the correct value for the platform
fixes this issue. The code utilizes autodetection of docker's cgroup
driver, as different RPMs for the same distro may vary in that regard.
New files: /etc/kubernetes/admin.conf
/root/.kube/config
$GITDIR/artifacts/{kubectl,admin.conf}
Optional method to download kubectl and admin.conf if
kubeconfig_lcoalhost is set to true (default false)
* kubeadm support
* move k8s master to a subtask
* disable k8s secrets when using kubeadm
* fix etcd cert serial var
* move simple auth users to master role
* make a kubeadm-specific env file for kubelet
* add non-ha CI job
* change ci boolean vars to json format
* fixup
* Update create-gce.yml
* Update create-gce.yml
* Update create-gce.yml
* Fix netchecker update side effect
kubectl apply should only be used on resources created
with kubectl apply. To workaround this, we should apply
the old manifest before upgrading it.
* Update 030_check-network.yml
This sets br_netfilter and net.bridge.bridge-nf-call-iptables sysctl from a single play before kube-proxy is first ran instead of from the flannel and weave network_plugin roles after kube-proxy is started
the uploads.yml playbook was broken with checksum mismatch errors in
various kubespray commits, for example, 3bfad5ca73
which updated the version from 3.0.6 to 3.0.17 without updating the
corresponding checksums.
* using separated vault roles for generate certs with different `O` (Organization) subject field;
* configure vault roles for issuing certificates with different `CN` (Common name) subject field;
* set `CN` and `O` to `kubernetes` and `etcd` certificates;
* vault/defaults vars definition was simplified;
* vault dirs variables defined in kubernetes-defaults foles for using
shared tasks in etcd and kubernetes/secrets roles;
* upgrade vault to 0.8.1;
* generate random vault user password for each role by default;
* fix `serial` file name for vault certs;
* move vault auth request to issue_cert tasks;
* enable `RBAC` in vault CI;
* Use kubectl apply instead of create/replace
Disable checks for existing resources to speed up execution.
* Fix non-rbac deployment of resources as a list
* Fix autoscaler tolerations field
* set all kube resources to state=latest
* Update netchecker and weave
* Added update CA trust step for etcd and kube/secrets roles
* Added load_balancer_domain_name to certificate alt names if defined. Reset CA's in RedHat os.
* Rename kube-cluster-ca.crt to vault-ca.crt, we need separated CA`s for vault, etcd and kube.
* Vault role refactoring, remove optional cert vault auth because not not used and worked. Create separate CA`s fro vault and etcd.
* Fixed different certificates set for vault cert_managment
* Update doc/vault.md
* Fixed condition create vault CA, wrong group
* Fixed missing etcd_cert_path mount for rkt deployment type. Distribute vault roles for all vault hosts
* Removed wrong when condition in create etcd role vault tasks.
* Updates Controller Manager/Kubelet with Flannel's required configuration for CNI
* Removes old Flannel installation
* Install CNI enabled Flannel DaemonSet/ConfigMap/CNI bins and config (with portmap plugin) on host
* Uses RBAC if enabled
* Fixed an issue that could occur if br_netfilter is not a module and net.bridge.bridge-nf-call-iptables sysctl was not set
* Adding yaml linter to ci check
* Minor linting fixes from yamllint
* Changing CI to install python pkgs from requirements.txt
- adding in a secondary requirements.txt for tests
- moving yamllint to tests requirements
If Kubernetes > 1.6 register standalone master nodes w/ a
node-role.kubernetes.io/master=:NoSchedule taint to allow
for more flexible scheduling rather than just marking unschedulable.