Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrinking and extending does not work with first node #210

Closed
pfaelzerchen opened this issue Jul 16, 2023 · 5 comments · Fixed by #219
Closed

Shrinking and extending does not work with first node #210

pfaelzerchen opened this issue Jul 16, 2023 · 5 comments · Fixed by #219
Labels
wontfix This will not be worked on

Comments

@pfaelzerchen
Copy link

Summary

I've set up a three node HA cluster with etcd following the quickstart guide with hosts tick, trick and track. Then I wanted to test how to take single nodes from out (e.g. to install new ubuntu lts releases) and get them back to the cluster. This works fine with trick and track, but not with tick.

I'm relatively new to ansible and k3s, so sorry if I didn't see something obvious.

Issue Type

  • Bug Report

Controller Environment and Configuration

I'm using v3.4.2 from ansible-galaxy. Following the dump from shrinking.

# Begin ANSIBLE VERSION
ansible [core 2.14.2]
  config file = None
  configured module search path = ['/home/matthias/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python3/dist-packages/ansible
  ansible collection location = /home/matthias/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/bin/ansible
  python version = 3.11.2 (main, May 30 2023, 17:45:26) [GCC 12.2.0] (/usr/bin/python3)
  jinja version = 3.1.2
  libyaml = True
# End ANSIBLE VERSION

# Begin ANSIBLE CONFIG
CONFIG_FILE() = None
# End ANSIBLE CONFIG

# Begin ANSIBLE ROLES
# /home/matthias/.ansible/roles
- hifis.unattended_upgrades, v3.1.0
- xanmanning.k3s, v3.4.2
# End ANSIBLE ROLES

# Begin PLAY HOSTS
["tick", "trick", "track"]
# End PLAY HOSTS

# Begin K3S ROLE CONFIG
## tick
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_state: "uninstalled"
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t129\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.605739", "end": "2023-07-16 12:24:52.607773", "delta": "0:00:00.002034", "msg": "", "stdout_lines": ["cpuset\t0\t129\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

## trick
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t133\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.741053", "end": "2023-07-16 12:24:52.744222", "delta": "0:00:00.003169", "msg": "", "stdout_lines": ["cpuset\t0\t133\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

## track
k3s_control_node: true
k3s_server: {"disable": ["traefik"]}
k3s_check_openrc_run: {"changed": false, "skipped": true, "skip_reason": "Conditional result was False"}
k3s_check_cgroup_option: {"changed": false, "stdout": "cpuset\t0\t129\t1", "stderr": "", "rc": 0, "cmd": ["grep", "-E", "^cpuset\\s+.*\\s+1$", "/proc/cgroups"], "start": "2023-07-16 12:24:52.737496", "end": "2023-07-16 12:24:52.740649", "delta": "0:00:00.003153", "msg": "", "stdout_lines": ["cpuset\t0\t129\t1"], "stderr_lines": [], "failed": false, "failed_when_result": false}

# End K3S ROLE CONFIG

# Begin K3S RUNTIME CONFIG
## tick
## trick
## track
# End K3S RUNTIME CONFIG

Steps to Reproduce

  1. Set up a three node cluster as given in quickstart documentation
  2. Following the shrinking documentation for track => cluster is alive with 2 nodes
  3. Following the extending documentation for track => cluster is alive with 3 nodes
  4. Following the shrinking documentation for tick => there may be errors running the playbook, but the cluster remains alive with 2 nodes
  5. Following the extending documentation for tick => there are errors running the playbook and the process may hang and needs to be rerun. The errors in the playbook-run don't seem to be exactly reproducible.

Playbook:

---
- name: Install k3s cluster
  hosts: kubernetes
  remote_user: matthias
  become: true
  vars:
    k3s_release_version: v1.27.3+k3s1
    k3s_become: true
    k3s_etcd_datastore: true
    k3s_use_experimental: false  # Note this is required for k3s < v1.19.5+k3s1
    k3s_use_unsupported_config: false
    k3s_install_hard_links: true
    k3s_build_cluster: true

  roles:
    - role: xanmanning.k3s

Inventory:

---
all:
  children:
    kubernetes:
      hosts:
        tick:
          hostname: tick
        trick:
          hostname: trick
        track:
          hostname: track
      vars:
        k3s_control_node: true
        k3s_server:
          disable:
            - traefik

Expected Result

The cluster is up and running with three nodes and using the existing certificates.

Actual Result

tick is up and running a one node cluster, trick and track are unable to start k3s. The systemd unit fails on the host.

I also copied the kubectl configuration on my local machine. Locally, I cannot connect with kubectl any more as the certificates are wrong. So it seems tick got a completely new installation with new certificates. After steps 1 and 2 the cluster was still reachable with the existing certificates.

@pfaelzerchen
Copy link
Author

To add one more thing: I tried to simulate a defect node and therefore destroyed the partition table of tick, reinstalled a fresh OS (Ubuntu 22.04 LTS) and did a rerun of the existing k3s playbook. It starts to reinstall things on tick, but I basically got the same result: tick is up and running a new cluster, trick and track are defect.

@pfaelzerchen
Copy link
Author

I did some more experiments. It seems that the role really relies on the first node in the inventory being present:

  1. Switch positions of first (tick) and second (trick) node in the inventory
  2. Rerun ansible rollout
  3. Now add k3s_state: uninstalled to the now second node (tick) and rerun ansible rollout

tick is successfully out of the cluster.

  1. Now remove the uninstalled from the inventory and rerun the ansible rollout.

tick is again present in the cluster. Everything works fine.

  1. Switch positions of tick and trick again, tick being first node in the list again. Rerun ansible Rollout.

The rollout does some changes, but the cluster stays fully functional as expected in the first place.

So probably this isn't a bug in the code, but something for documentation. Using a HA deployment with three control nodes, this behaviour came definitely unexpected.

@stale
Copy link

stale bot commented Sep 16, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 16, 2023
@xanmanning
Copy link
Member

Not stale

@stale stale bot removed the wontfix This will not be worked on label Sep 16, 2023
Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants