Fix recover-control-plane to work with etcd 3.3.x and add CI (#5500)
* Fix recover-control-plane to work with etcd 3.3.x and add CI * Set default values for testcase * Add actual test jobs * Attempt to satisty gitlab ci linter * Fix ansible targets * Set etcd_member_name as stated in the docs... * Recovering from 0 masters is not supported yet * Add other master to broken_kube-master group as well * Increase number of retries to see if etcd needs more time to heal * Make number of retries for ETCD loops configurable, increase it for recovery CI and document it
This commit is contained in:
parent
68c8c05775
commit
ac2135e450
23 changed files with 204 additions and 134 deletions
|
@ -26,6 +26,8 @@ variables:
|
||||||
RESET_CHECK: "false"
|
RESET_CHECK: "false"
|
||||||
UPGRADE_TEST: "false"
|
UPGRADE_TEST: "false"
|
||||||
LOG_LEVEL: "-vv"
|
LOG_LEVEL: "-vv"
|
||||||
|
RECOVER_CONTROL_PLANE_TEST: "false"
|
||||||
|
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[2:],kube-master[1:]"
|
||||||
|
|
||||||
before_script:
|
before_script:
|
||||||
- ./tests/scripts/rebase.sh
|
- ./tests/scripts/rebase.sh
|
||||||
|
|
|
@ -124,3 +124,19 @@ packet_amazon-linux-2-aio:
|
||||||
stage: deploy-part2
|
stage: deploy-part2
|
||||||
extends: .packet
|
extends: .packet
|
||||||
when: manual
|
when: manual
|
||||||
|
|
||||||
|
packet_ubuntu18-calico-ha-recover:
|
||||||
|
stage: deploy-part2
|
||||||
|
extends: .packet
|
||||||
|
when: on_success
|
||||||
|
variables:
|
||||||
|
RECOVER_CONTROL_PLANE_TEST: "true"
|
||||||
|
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[2:],kube-master[1:]"
|
||||||
|
|
||||||
|
packet_ubuntu18-calico-ha-recover-noquorum:
|
||||||
|
stage: deploy-part2
|
||||||
|
extends: .packet
|
||||||
|
when: on_success
|
||||||
|
variables:
|
||||||
|
RECOVER_CONTROL_PLANE_TEST: "true"
|
||||||
|
RECOVER_CONTROL_PLANE_TEST_GROUPS: "etcd[1:],kube-master[1:]"
|
||||||
|
|
|
@ -17,37 +17,23 @@ Examples of what broken means in this context:
|
||||||
|
|
||||||
__Note that you need at least one functional node to be able to recover using this method.__
|
__Note that you need at least one functional node to be able to recover using this method.__
|
||||||
|
|
||||||
## If etcd quorum is intact
|
## Runbook
|
||||||
|
|
||||||
* Set the etcd member names of the broken node(s) in the variable "old\_etcd\_members", this variable is used to remove the broken nodes from the etcd cluster.
|
* Move any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
|
||||||
```old_etcd_members=etcd2,etcd3```
|
* Move any broken master nodes into the "broken\_kube-master" group.
|
||||||
* If you reuse identities for your etcd nodes add the inventory names for those nodes to the variable "old\_etcds". This will remove any previously generated certificates for those nodes.
|
|
||||||
```old_etcds=etcd2.example.com,etcd3.example.com```
|
|
||||||
* If you would like to remove the broken node objects from the kubernetes cluster add their inventory names to the variable "old\_kube\_masters"
|
|
||||||
```old_kube_masters=master2.example.com,master3.example.com```
|
|
||||||
|
|
||||||
Then run the playbook with ```--limit etcd,kube-master```
|
Then run the playbook with ```--limit etcd,kube-master``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
|
||||||
|
|
||||||
When finished you should have a fully working and highly available control plane again.
|
When finished you should have a fully working control plane again.
|
||||||
|
|
||||||
## If etcd quorum is lost
|
## Recover from lost quorum
|
||||||
|
|
||||||
* If you reuse identities for your etcd nodes add the inventory names for those nodes to the variable "old\_etcds". This will remove any previously generated certificates for those nodes.
|
The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
|
||||||
```old_etcds=etcd2.example.com,etcd3.example.com```
|
|
||||||
* If you would like to remove the broken node objects from the kubernetes cluster add their inventory names to the variable "old\_kube\_masters"
|
|
||||||
```old_kube_masters=master2.example.com,master3.example.com```
|
|
||||||
|
|
||||||
Then run the playbook with ```--limit etcd,kube-master```
|
```-e etcd_snapshot=/tmp/etcd_snapshot```
|
||||||
|
|
||||||
When finished you should have a fully working and highly available control plane again.
|
|
||||||
|
|
||||||
The playbook will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
|
|
||||||
|
|
||||||
```etcd_snapshot=/tmp/etcd_snapshot```
|
|
||||||
|
|
||||||
## Caveats
|
## Caveats
|
||||||
|
|
||||||
* The playbook has only been tested on control planes where the etcd and kube-master nodes are the same, the playbook will warn if run on a cluster with separate etcd and kube-master nodes.
|
|
||||||
* The playbook has only been tested with fairly small etcd databases.
|
* The playbook has only been tested with fairly small etcd databases.
|
||||||
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
|
* If your new control plane nodes have new ip addresses you may have to change settings in various places.
|
||||||
* There may be disruptions while running the playbook.
|
* There may be disruptions while running the playbook.
|
||||||
|
|
|
@ -22,7 +22,6 @@
|
||||||
- hosts: "{{ groups['etcd'] | first }}"
|
- hosts: "{{ groups['etcd'] | first }}"
|
||||||
roles:
|
roles:
|
||||||
- { role: kubespray-defaults}
|
- { role: kubespray-defaults}
|
||||||
- { role: recover_control_plane/pre-recover }
|
|
||||||
- { role: recover_control_plane/etcd }
|
- { role: recover_control_plane/etcd }
|
||||||
|
|
||||||
- hosts: "{{ groups['kube-master'] | first }}"
|
- hosts: "{{ groups['kube-master'] | first }}"
|
||||||
|
|
|
@ -62,3 +62,6 @@ etcd_secure_client: true
|
||||||
|
|
||||||
# Enable peer client cert authentication
|
# Enable peer client cert authentication
|
||||||
etcd_peer_client_auth: true
|
etcd_peer_client_auth: true
|
||||||
|
|
||||||
|
# Number of loop retries
|
||||||
|
etcd_retries: 4
|
||||||
|
|
|
@ -67,7 +67,7 @@
|
||||||
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_client_url }} cluster-health | grep -q 'cluster is healthy'"
|
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_client_url }} cluster-health | grep -q 'cluster is healthy'"
|
||||||
register: etcd_cluster_is_healthy
|
register: etcd_cluster_is_healthy
|
||||||
until: etcd_cluster_is_healthy.rc == 0
|
until: etcd_cluster_is_healthy.rc == 0
|
||||||
retries: 4
|
retries: "{{ etcd_retries }}"
|
||||||
delay: "{{ retry_stagger | random + 3 }}"
|
delay: "{{ retry_stagger | random + 3 }}"
|
||||||
ignore_errors: false
|
ignore_errors: false
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
@ -88,7 +88,7 @@
|
||||||
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_events_client_url }} cluster-health | grep -q 'cluster is healthy'"
|
shell: "{{ bin_dir }}/etcdctl --no-sync --endpoints={{ etcd_events_client_url }} cluster-health | grep -q 'cluster is healthy'"
|
||||||
register: etcd_events_cluster_is_healthy
|
register: etcd_events_cluster_is_healthy
|
||||||
until: etcd_events_cluster_is_healthy.rc == 0
|
until: etcd_events_cluster_is_healthy.rc == 0
|
||||||
retries: 4
|
retries: "{{ etcd_retries }}"
|
||||||
delay: "{{ retry_stagger | random + 3 }}"
|
delay: "{{ retry_stagger | random + 3 }}"
|
||||||
ignore_errors: false
|
ignore_errors: false
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
|
|
@ -6,7 +6,7 @@
|
||||||
{{ docker_bin_dir }}/docker rm -f etcdctl-binarycopy"
|
{{ docker_bin_dir }}/docker rm -f etcdctl-binarycopy"
|
||||||
register: etcd_task_result
|
register: etcd_task_result
|
||||||
until: etcd_task_result.rc == 0
|
until: etcd_task_result.rc == 0
|
||||||
retries: 4
|
retries: "{{ etcd_retries }}"
|
||||||
delay: "{{ retry_stagger | random + 3 }}"
|
delay: "{{ retry_stagger | random + 3 }}"
|
||||||
changed_when: false
|
changed_when: false
|
||||||
when: etcd_cluster_setup
|
when: etcd_cluster_setup
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_events_access_addresses }} member add {{ etcd_member_name }} {{ etcd_events_peer_url }}"
|
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_events_access_addresses }} member add {{ etcd_member_name }} {{ etcd_events_peer_url }}"
|
||||||
register: member_add_result
|
register: member_add_result
|
||||||
until: member_add_result.rc == 0
|
until: member_add_result.rc == 0
|
||||||
retries: 4
|
retries: "{{ etcd_retries }}"
|
||||||
delay: "{{ retry_stagger | random + 3 }}"
|
delay: "{{ retry_stagger | random + 3 }}"
|
||||||
when: target_node == inventory_hostname
|
when: target_node == inventory_hostname
|
||||||
environment:
|
environment:
|
||||||
|
|
|
@ -3,7 +3,7 @@
|
||||||
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} member add {{ etcd_member_name }} {{ etcd_peer_url }}"
|
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} member add {{ etcd_member_name }} {{ etcd_peer_url }}"
|
||||||
register: member_add_result
|
register: member_add_result
|
||||||
until: member_add_result.rc == 0
|
until: member_add_result.rc == 0
|
||||||
retries: 4
|
retries: "{{ etcd_retries }}"
|
||||||
delay: "{{ retry_stagger | random + 3 }}"
|
delay: "{{ retry_stagger | random + 3 }}"
|
||||||
when: target_node == inventory_hostname
|
when: target_node == inventory_hostname
|
||||||
environment:
|
environment:
|
||||||
|
|
|
@ -1,7 +1,78 @@
|
||||||
---
|
---
|
||||||
- include_tasks: prepare.yml
|
- name: Get etcd endpoint health
|
||||||
|
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem --endpoints={{ etcd_access_addresses }} endpoint health"
|
||||||
|
register: etcd_endpoint_health
|
||||||
|
ignore_errors: true
|
||||||
|
changed_when: false
|
||||||
|
check_mode: no
|
||||||
|
environment:
|
||||||
|
- ETCDCTL_API: 3
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
|
||||||
|
- name: Set healthy fact
|
||||||
|
set_fact:
|
||||||
|
healthy: "{{ etcd_endpoint_health.stderr | match('Error: unhealthy cluster') }}"
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
|
||||||
|
- name: Set has_quorum fact
|
||||||
|
set_fact:
|
||||||
|
has_quorum: "{{ etcd_endpoint_health.stdout_lines | select('match', '.*is healthy.*') | list | length >= etcd_endpoint_health.stderr_lines | select('match', '.*is unhealthy.*') | list | length }}"
|
||||||
|
|
||||||
- include_tasks: recover_lost_quorum.yml
|
- include_tasks: recover_lost_quorum.yml
|
||||||
when:
|
when:
|
||||||
- has_etcdctl
|
- groups['broken_etcd']
|
||||||
- not etcd_cluster_is_healthy
|
- not has_quorum
|
||||||
|
|
||||||
|
- name: Remove etcd data dir
|
||||||
|
file:
|
||||||
|
path: "{{ etcd_data_dir }}"
|
||||||
|
state: absent
|
||||||
|
delegate_to: "{{ item }}"
|
||||||
|
with_items: "{{ groups['broken_etcd'] }}"
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
- has_quorum
|
||||||
|
|
||||||
|
- name: Delete old certificates
|
||||||
|
# noqa 302 - rm is ok here for now
|
||||||
|
shell: "rm {{ etcd_cert_dir }}/*{{ item }}*"
|
||||||
|
with_items: "{{ groups['broken_etcd'] }}"
|
||||||
|
register: delete_old_cerificates
|
||||||
|
ignore_errors: true
|
||||||
|
when: groups['broken_etcd']
|
||||||
|
|
||||||
|
- name: Fail if unable to delete old certificates
|
||||||
|
fail:
|
||||||
|
msg: "Unable to delete old certificates for: {{ item.item }}"
|
||||||
|
loop: "{{ delete_old_cerificates.results }}"
|
||||||
|
changed_when: false
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
- "item.rc != 0 and not 'No such file or directory' in item.stderr"
|
||||||
|
|
||||||
|
- name: Get etcd cluster members
|
||||||
|
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem member list"
|
||||||
|
register: member_list
|
||||||
|
changed_when: false
|
||||||
|
check_mode: no
|
||||||
|
environment:
|
||||||
|
- ETCDCTL_API: 3
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
- not healthy
|
||||||
|
- has_quorum
|
||||||
|
|
||||||
|
- name: Remove broken cluster members
|
||||||
|
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem --endpoints={{ etcd_access_addresses }} member remove {{ item[1].replace(' ','').split(',')[0] }}"
|
||||||
|
environment:
|
||||||
|
- ETCDCTL_API: 3
|
||||||
|
with_nested:
|
||||||
|
- "{{ groups['broken_etcd'] }}"
|
||||||
|
- "{{ member_list.stdout_lines }}"
|
||||||
|
when:
|
||||||
|
- groups['broken_etcd']
|
||||||
|
- not healthy
|
||||||
|
- has_quorum
|
||||||
|
- hostvars[item[0]]['etcd_member_name'] == item[1].replace(' ','').split(',')[2]
|
||||||
|
|
|
@ -1,48 +0,0 @@
|
||||||
---
|
|
||||||
- name: Delete old certificates
|
|
||||||
# noqa 302 - rm is ok here for now
|
|
||||||
shell: "rm /etc/ssl/etcd/ssl/*{{ item }}* /etc/kubernetes/ssl/etcd/*{{ item }}*"
|
|
||||||
with_items: "{{ old_etcds.split(',') }}"
|
|
||||||
register: delete_old_cerificates
|
|
||||||
ignore_errors: true
|
|
||||||
when: old_etcds is defined
|
|
||||||
|
|
||||||
- name: Fail if unable to delete old certificates
|
|
||||||
fail:
|
|
||||||
msg: "Unable to delete old certificates for: {{ item.item }}"
|
|
||||||
loop: "{{ delete_old_cerificates.results }}"
|
|
||||||
changed_when: false
|
|
||||||
when:
|
|
||||||
- old_etcds is defined
|
|
||||||
- "item.rc != 0 and not 'No such file or directory' in item.stderr"
|
|
||||||
|
|
||||||
- name: Get etcd cluster members
|
|
||||||
shell: "{{ bin_dir }}/etcdctl member list"
|
|
||||||
register: member_list
|
|
||||||
changed_when: false
|
|
||||||
check_mode: no
|
|
||||||
environment:
|
|
||||||
- ETCDCTL_API: 3
|
|
||||||
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
|
|
||||||
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem"
|
|
||||||
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem"
|
|
||||||
when:
|
|
||||||
- has_etcdctl
|
|
||||||
- etcd_cluster_is_healthy
|
|
||||||
- old_etcd_members is defined
|
|
||||||
|
|
||||||
- name: Remove old cluster members
|
|
||||||
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} member remove {{ item[1].replace(' ','').split(',')[0] }}"
|
|
||||||
environment:
|
|
||||||
- ETCDCTL_API: 3
|
|
||||||
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
|
|
||||||
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}.pem"
|
|
||||||
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/admin-{{ inventory_hostname }}-key.pem"
|
|
||||||
with_nested:
|
|
||||||
- "{{ old_etcd_members.split(',') }}"
|
|
||||||
- "{{ member_list.stdout_lines }}"
|
|
||||||
when:
|
|
||||||
- has_etcdctl
|
|
||||||
- etcd_cluster_is_healthy
|
|
||||||
- old_etcd_members is defined
|
|
||||||
- item[0] == item[1].replace(' ','').split(',')[2]
|
|
|
@ -1,11 +1,8 @@
|
||||||
---
|
---
|
||||||
- name: Save etcd snapshot
|
- name: Save etcd snapshot
|
||||||
shell: "{{ bin_dir }}/etcdctl snapshot save /tmp/snapshot.db"
|
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem snapshot save /tmp/snapshot.db"
|
||||||
environment:
|
environment:
|
||||||
- ETCDCTL_API: 3
|
- ETCDCTL_API: 3
|
||||||
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
|
|
||||||
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem"
|
|
||||||
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem"
|
|
||||||
when: etcd_snapshot is not defined
|
when: etcd_snapshot is not defined
|
||||||
|
|
||||||
- name: Transfer etcd snapshot to host
|
- name: Transfer etcd snapshot to host
|
||||||
|
@ -25,12 +22,9 @@
|
||||||
state: absent
|
state: absent
|
||||||
|
|
||||||
- name: Restore etcd snapshot
|
- name: Restore etcd snapshot
|
||||||
shell: "{{ bin_dir }}/etcdctl snapshot restore /tmp/snapshot.db --name {{ etcd_member_name }} --initial-cluster {{ etcd_member_name }}={{ etcd_peer_url }} --initial-cluster-token k8s_etcd --initial-advertise-peer-urls {{ etcd_peer_url }} --data-dir {{ etcd_data_dir }}"
|
shell: "{{ bin_dir }}/etcdctl --cacert {{ etcd_cert_dir }}/ca.pem --cert {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem --key {{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem snapshot restore /tmp/snapshot.db --name {{ etcd_member_name }} --initial-cluster {{ etcd_member_name }}={{ etcd_peer_url }} --initial-cluster-token k8s_etcd --initial-advertise-peer-urls {{ etcd_peer_url }} --data-dir {{ etcd_data_dir }}"
|
||||||
environment:
|
environment:
|
||||||
- ETCDCTL_API: 3
|
- ETCDCTL_API: 3
|
||||||
- ETCDCTL_CA_FILE: /etc/ssl/etcd/ssl/ca.pem
|
|
||||||
- ETCDCTL_CERT: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}.pem"
|
|
||||||
- ETCDCTL_KEY: "/etc/ssl/etcd/ssl/member-{{ inventory_hostname }}-key.pem"
|
|
||||||
|
|
||||||
- name: Remove etcd snapshot
|
- name: Remove etcd snapshot
|
||||||
file:
|
file:
|
||||||
|
|
|
@ -8,21 +8,22 @@
|
||||||
retries: 6
|
retries: 6
|
||||||
delay: 10
|
delay: 10
|
||||||
changed_when: false
|
changed_when: false
|
||||||
|
when: groups['broken_kube-master']
|
||||||
|
|
||||||
- name: Delete old kube-master nodes from cluster
|
- name: Delete broken kube-master nodes from cluster
|
||||||
shell: "{{ bin_dir }}/kubectl delete node {{ item }}"
|
shell: "{{ bin_dir }}/kubectl delete node {{ item }}"
|
||||||
environment:
|
environment:
|
||||||
- KUBECONFIG: "{{ ansible_env.HOME | default('/root') }}/.kube/config"
|
- KUBECONFIG: "{{ ansible_env.HOME | default('/root') }}/.kube/config"
|
||||||
with_items: "{{ old_kube_masters.split(',') }}"
|
with_items: "{{ groups['broken_kube-master'] }}"
|
||||||
register: delete_old_kube_masters
|
register: delete_broken_kube_masters
|
||||||
failed_when: false
|
failed_when: false
|
||||||
when: old_kube_masters is defined
|
when: groups['broken_kube-master']
|
||||||
|
|
||||||
- name: Fail if unable to delete old kube-master nodes from cluster
|
- name: Fail if unable to delete broken kube-master nodes from cluster
|
||||||
fail:
|
fail:
|
||||||
msg: "Unable to delete old kube-master node: {{ item.item }}"
|
msg: "Unable to delete broken kube-master node: {{ item.item }}"
|
||||||
loop: "{{ delete_old_kube_masters.results }}"
|
loop: "{{ delete_broken_kube_masters.results }}"
|
||||||
changed_when: false
|
changed_when: false
|
||||||
when:
|
when:
|
||||||
- old_kube_masters is defined
|
- groups['broken_kube-master']
|
||||||
- "item.rc != 0 and not 'NotFound' in item.stderr"
|
- "item.rc != 0 and not 'NotFound' in item.stderr"
|
||||||
|
|
|
@ -1,2 +0,0 @@
|
||||||
---
|
|
||||||
control_plane_is_converged: "{{ groups['etcd'] | sort == groups['kube-master'] | sort | bool }}"
|
|
|
@ -1,36 +0,0 @@
|
||||||
---
|
|
||||||
- name: Check for etcdctl binary
|
|
||||||
raw: "test -e {{ bin_dir }}/etcdctl"
|
|
||||||
register: test_etcdctl
|
|
||||||
|
|
||||||
- name: Set has_etcdctl fact
|
|
||||||
set_fact:
|
|
||||||
has_etcdctl: "{{ test_etcdctl.rc == 0 | bool }}"
|
|
||||||
|
|
||||||
- name: Check if etcd cluster is healthy
|
|
||||||
shell: "{{ bin_dir }}/etcdctl --endpoints={{ etcd_access_addresses }} cluster-health | grep -q 'cluster is healthy'"
|
|
||||||
register: etcd_cluster_health
|
|
||||||
ignore_errors: true
|
|
||||||
changed_when: false
|
|
||||||
check_mode: no
|
|
||||||
environment:
|
|
||||||
ETCDCTL_CERT_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}.pem"
|
|
||||||
ETCDCTL_KEY_FILE: "{{ etcd_cert_dir }}/admin-{{ inventory_hostname }}-key.pem"
|
|
||||||
ETCDCTL_CA_FILE: "{{ etcd_cert_dir }}/ca.pem"
|
|
||||||
when: has_etcdctl
|
|
||||||
|
|
||||||
- name: Set etcd_cluster_is_healthy fact
|
|
||||||
set_fact:
|
|
||||||
etcd_cluster_is_healthy: "{{ etcd_cluster_health.rc == 0 | bool }}"
|
|
||||||
|
|
||||||
- name: Abort if etcd cluster is healthy and old_etcd_members is undefined
|
|
||||||
assert:
|
|
||||||
that: "{{ old_etcd_members is defined }}"
|
|
||||||
msg: "'old_etcd_members' must be defined when the etcd cluster has quorum."
|
|
||||||
when: etcd_cluster_is_healthy
|
|
||||||
|
|
||||||
- name: Warn for untested recovery
|
|
||||||
debug:
|
|
||||||
msg: Control plane recovery of split control planes is UNTESTED! Abort or continue at your own risk.
|
|
||||||
delay: 30
|
|
||||||
when: not control_plane_is_converged
|
|
|
@ -5,7 +5,7 @@
|
||||||
|
|
||||||
- name: Set VM count needed for CI test_id
|
- name: Set VM count needed for CI test_id
|
||||||
set_fact:
|
set_fact:
|
||||||
vm_count: "{%- if mode in ['separate', 'separate-scale', 'ha', 'ha-scale'] -%}{{ 3|int }}{%- elif mode == 'aio' -%}{{ 1|int }}{%- else -%}{{ 2|int }}{%- endif -%}"
|
vm_count: "{%- if mode in ['separate', 'separate-scale', 'ha', 'ha-scale', 'ha-recover', 'ha-recover-noquorum'] -%}{{ 3|int }}{%- elif mode == 'aio' -%}{{ 1|int }}{%- else -%}{{ 2|int }}{%- endif -%}"
|
||||||
|
|
||||||
- import_tasks: create-vms.yml
|
- import_tasks: create-vms.yml
|
||||||
when:
|
when:
|
||||||
|
|
|
@ -45,6 +45,45 @@ instance-1
|
||||||
|
|
||||||
[vault]
|
[vault]
|
||||||
instance-1
|
instance-1
|
||||||
|
{% elif mode == "ha-recover" %}
|
||||||
|
[kube-master]
|
||||||
|
instance-1
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[kube-node]
|
||||||
|
instance-3
|
||||||
|
|
||||||
|
[etcd]
|
||||||
|
instance-3
|
||||||
|
instance-1
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[broken_kube-master]
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[broken_etcd]
|
||||||
|
instance-2 etcd_member_name=etcd3
|
||||||
|
{% elif mode == "ha-recover-noquorum" %}
|
||||||
|
[kube-master]
|
||||||
|
instance-3
|
||||||
|
instance-1
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[kube-node]
|
||||||
|
instance-3
|
||||||
|
|
||||||
|
[etcd]
|
||||||
|
instance-3
|
||||||
|
instance-1
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[broken_kube-master]
|
||||||
|
instance-1
|
||||||
|
instance-2
|
||||||
|
|
||||||
|
[broken_etcd]
|
||||||
|
instance-1 etcd_member_name=etcd2
|
||||||
|
instance-2 etcd_member_name=etcd3
|
||||||
{% endif %}
|
{% endif %}
|
||||||
|
|
||||||
[k8s-cluster:children]
|
[k8s-cluster:children]
|
||||||
|
|
10
tests/files/packet_ubuntu18-calico-ha-recover-noquorum.yml
Normal file
10
tests/files/packet_ubuntu18-calico-ha-recover-noquorum.yml
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
---
|
||||||
|
# Instance settings
|
||||||
|
cloud_image: ubuntu-1804
|
||||||
|
mode: ha-recover-noquorum
|
||||||
|
vm_memory: 1600Mi
|
||||||
|
|
||||||
|
# Kubespray settings
|
||||||
|
kube_network_plugin: calico
|
||||||
|
deploy_netchecker: true
|
||||||
|
dns_min_replicas: 1
|
10
tests/files/packet_ubuntu18-calico-ha-recover.yml
Normal file
10
tests/files/packet_ubuntu18-calico-ha-recover.yml
Normal file
|
@ -0,0 +1,10 @@
|
||||||
|
---
|
||||||
|
# Instance settings
|
||||||
|
cloud_image: ubuntu-1804
|
||||||
|
mode: ha-recover
|
||||||
|
vm_memory: 1600Mi
|
||||||
|
|
||||||
|
# Kubespray settings
|
||||||
|
kube_network_plugin: calico
|
||||||
|
deploy_netchecker: true
|
||||||
|
dns_min_replicas: 1
|
|
@ -47,6 +47,12 @@ if [ "${UPGRADE_TEST}" != "false" ]; then
|
||||||
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" $PLAYBOOK
|
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" $PLAYBOOK
|
||||||
fi
|
fi
|
||||||
|
|
||||||
|
# Test control plane recovery
|
||||||
|
if [ "${RECOVER_CONTROL_PLANE_TEST}" != "false" ]; then
|
||||||
|
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} --limit "${RECOVER_CONTROL_PLANE_TEST_GROUPS}:!fake_hosts" -e reset_confirmation=yes reset.yml
|
||||||
|
ansible-playbook ${LOG_LEVEL} -e @${CI_TEST_VARS} -e local_release_dir=${PWD}/downloads -e ansible_python_interpreter=${PYPATH} -e etcd_retries=10 --limit etcd,kube-master:!fake_hosts recover-control-plane.yml
|
||||||
|
fi
|
||||||
|
|
||||||
# Tests Cases
|
# Tests Cases
|
||||||
## Test Master API
|
## Test Master API
|
||||||
ansible-playbook -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" tests/testcases/010_check-apiserver.yml $LOG_LEVEL
|
ansible-playbook -e ansible_python_interpreter=${PYPATH} --limit "all:!fake_hosts" tests/testcases/010_check-apiserver.yml $LOG_LEVEL
|
||||||
|
|
|
@ -25,3 +25,9 @@ kube-master
|
||||||
calico-rr
|
calico-rr
|
||||||
|
|
||||||
[calico-rr]
|
[calico-rr]
|
||||||
|
|
||||||
|
[broken_kube-master]
|
||||||
|
node2
|
||||||
|
|
||||||
|
[broken_etcd]
|
||||||
|
node2
|
||||||
|
|
|
@ -29,6 +29,12 @@
|
||||||
[vault]
|
[vault]
|
||||||
{{droplets.results[1].droplet.name}}
|
{{droplets.results[1].droplet.name}}
|
||||||
{{droplets.results[2].droplet.name}}
|
{{droplets.results[2].droplet.name}}
|
||||||
|
|
||||||
|
[broken_kube-master]
|
||||||
|
{{droplets.results[1].droplet.name}}
|
||||||
|
|
||||||
|
[broken_etcd]
|
||||||
|
{{droplets.results[2].droplet.name}}
|
||||||
{% else %}
|
{% else %}
|
||||||
[kube-master]
|
[kube-master]
|
||||||
{{droplets.results[0].droplet.name}}
|
{{droplets.results[0].droplet.name}}
|
||||||
|
|
|
@ -37,6 +37,13 @@
|
||||||
{{node1}}
|
{{node1}}
|
||||||
{{node2}}
|
{{node2}}
|
||||||
{{node3}}
|
{{node3}}
|
||||||
|
|
||||||
|
[broken_kube-master]
|
||||||
|
{{node2}}
|
||||||
|
|
||||||
|
[etcd]
|
||||||
|
{{node2}}
|
||||||
|
{{node3}}
|
||||||
{% elif mode == "default" %}
|
{% elif mode == "default" %}
|
||||||
[kube-master]
|
[kube-master]
|
||||||
{{node1}}
|
{{node1}}
|
||||||
|
|
Loading…
Reference in a new issue