Shrinking and extending does not work with first node #210

pfaelzerchen · 2023-07-16T10:50:30Z

Summary

I've set up a three node HA cluster with etcd following the quickstart guide with hosts tick, trick and track. Then I wanted to test how to take single nodes from out (e.g. to install new ubuntu lts releases) and get them back to the cluster. This works fine with trick and track, but not with tick.

I'm relatively new to ansible and k3s, so sorry if I didn't see something obvious.

Issue Type

Bug Report

Controller Environment and Configuration

I'm using v3.4.2 from ansible-galaxy. Following the dump from shrinking.

# Begin ANSIBLE VERSION
ansible [core 2.14.2]
  config file = None
  configured module search path = ['/home/matthias/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/matthias/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.11.2 (main, May 30 2023, 17:45:26) [GCC 12.2.0] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True
# End ANSIBLE VERSION

# Begin ANSIBLE CONFIG
CONFIG_FILE() = None
# End ANSIBLE CONFIG

# Begin ANSIBLE ROLES
# /home/matthias/.ansible/roles
- hifis.unattended_upgrades, v3.1.0
- xanmanning.k3s, v3.4.2
# End ANSIBLE ROLES

# Begin PLAY HOSTS
["tick", "trick", "track"]
# End PLAY HOSTS

# Begin K3S ROLE CONFIG
## tick
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_state: "uninstalled"
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t129\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.605739", "end": "2023-07-16 12:24:52.607773", "delta": "0:00:00.002034", "msg": "", "stdout_lines": ["cpuset\t0\t129\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

## trick
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t133\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.741053", "end": "2023-07-16 12:24:52.744222", "delta": "0:00:00.003169", "msg": "", "stdout_lines": ["cpuset\t0\t133\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

## track
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t129\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.737496", "end": "2023-07-16 12:24:52.740649", "delta": "0:00:00.003153", "msg": "", "stdout_lines": ["cpuset\t0\t129\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

# End K3S ROLE CONFIG

# Begin K3S RUNTIME CONFIG
## tick
## trick
## track
# End K3S RUNTIME CONFIG

Steps to Reproduce

Set up a three node cluster as given in quickstart documentation
Following the shrinking documentation for track => cluster is alive with 2 nodes
Following the extending documentation for track => cluster is alive with 3 nodes
Following the shrinking documentation for tick => there may be errors running the playbook, but the cluster remains alive with 2 nodes
Following the extending documentation for tick => there are errors running the playbook and the process may hang and needs to be rerun. The errors in the playbook-run don't seem to be exactly reproducible.

Playbook:

---
- name: Install k3s cluster
  hosts: kubernetes
  remote_user: matthias
  become: true
  vars:
    k3s_release_version: v1.27.3+k3s1
    k3s_become: true
    k3s_etcd_datastore: true
    k3s_use_experimental: false  # Note this is required for k3s < v1.19.5+k3s1
    k3s_use_unsupported_config: false
    k3s_install_hard_links: true
    k3s_build_cluster: true

  roles:
    - role: xanmanning.k3s

Inventory:

---
all:
  children:
    kubernetes:
      hosts:
        tick:
          hostname: tick
        trick:
          hostname: trick
        track:
          hostname: track
      vars:
        k3s_control_node: true
        k3s_server:
          disable:
            - traefik

Expected Result

The cluster is up and running with three nodes and using the existing certificates.

Actual Result

tick is up and running a one node cluster, trick and track are unable to start k3s. The systemd unit fails on the host.

I also copied the kubectl configuration on my local machine. Locally, I cannot connect with kubectl any more as the certificates are wrong. So it seems tick got a completely new installation with new certificates. After steps 1 and 2 the cluster was still reachable with the existing certificates.

The text was updated successfully, but these errors were encountered:

pfaelzerchen · 2023-07-16T11:24:13Z

To add one more thing: I tried to simulate a defect node and therefore destroyed the partition table of tick, reinstalled a fresh OS (Ubuntu 22.04 LTS) and did a rerun of the existing k3s playbook. It starts to reinstall things on tick, but I basically got the same result: tick is up and running a new cluster, trick and track are defect.

pfaelzerchen · 2023-07-16T13:16:01Z

I did some more experiments. It seems that the role really relies on the first node in the inventory being present:

Switch positions of first (tick) and second (trick) node in the inventory
Rerun ansible rollout
Now add k3s_state: uninstalled to the now second node (tick) and rerun ansible rollout

tick is successfully out of the cluster.

Now remove the uninstalled from the inventory and rerun the ansible rollout.

tick is again present in the cluster. Everything works fine.

Switch positions of tick and trick again, tick being first node in the list again. Rerun ansible Rollout.

The rollout does some changes, but the cluster stays fully functional as expected in the first place.

So probably this isn't a bug in the code, but something for documentation. Using a HA deployment with three control nodes, this behaviour came definitely unexpected.

stale · 2023-09-16T20:25:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

xanmanning · 2023-09-16T21:01:15Z

Not stale

stale · 2023-12-15T03:17:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix This will not be worked on label Sep 16, 2023

stale bot removed the wontfix This will not be worked on label Sep 16, 2023

stale bot added the wontfix This will not be worked on label Dec 15, 2023

paradon mentioned this issue Jan 7, 2024

Add scan for running control nodes when choosing primary control node #219

Merged

stale bot closed this as completed Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrinking and extending does not work with first node #210

Shrinking and extending does not work with first node #210

pfaelzerchen commented Jul 16, 2023

pfaelzerchen commented Jul 16, 2023

pfaelzerchen commented Jul 16, 2023

stale bot commented Sep 16, 2023

xanmanning commented Sep 16, 2023

stale bot commented Dec 15, 2023

Shrinking and extending does not work with first node #210

Shrinking and extending does not work with first node #210

Comments

pfaelzerchen commented Jul 16, 2023

Summary

Issue Type

Controller Environment and Configuration

Steps to Reproduce

Expected Result

Actual Result

pfaelzerchen commented Jul 16, 2023

pfaelzerchen commented Jul 16, 2023

stale bot commented Sep 16, 2023

xanmanning commented Sep 16, 2023

stale bot commented Dec 15, 2023