Issue running cluster-reset on v1.26 releases #4052

est-suse · 2023-03-23T19:15:01Z

Environmental Info:
RKE2 Version:

rke2 version v1.26.3-rc1+rke2r1 (81b04f085bd73d7a285f63087489600fb011a7a4)
go version go1.19.7 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

NAME="Red Hat Enterprise Linux"
VERSION="9.1 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.1 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/9/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"

Cluster Configuration:

3 servers - 1 agent

Describe the bug:

Steps To Reproduce:

1. Install RKE2 sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.26.3-rc1+rke2r1 INSTALL_RKE2_CHANNEL=testing sh -

Then Follow the next steps:

2.Deploy some workloads.
3. Stop two server nodes by using "sudo rke2-killall.sh"
4. Make sure the cluster is not accessible now
5. Shut down the rke2 server on that remaining node by running "sudo systemctl stop rke2-server"
6. Now run "sudo rke2 server --cluster-reset"

Expected behavior:

NO error message should be displayed followed by this message:

INFO[0041] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Actual behavior:

The following error message is displayed:

FATA[0952] failed to wait for apiserver ready: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": read tcp 127.0.0.1:45194->127.0.0.1:6443: read: connection reset by peer - error from a previous attempt: read tcp 127.0.0.1:44986->127.0.0.1:6443: read: connection reset by peer

Additional context / logs:

The text was updated successfully, but these errors were encountered:

brandond · 2023-03-23T20:14:21Z

I can reproduce this issue on 1.26.3-rc1+rke2r1 and v1.26.2+rke2r1
I cannot reproduce this on v1.26.1+rke2r1
I cannot reproduce this on v1.25.7+rke2r1 or v1.25.8-rc1+rke2r1

The repeating error that I see on v1.26 is:

{"level":"warn","ts":"2023-03-23T16:57:10.281Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004de8c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.4:2379: connect: connection refused\""}
{"level":"info","ts":"2023-03-23T16:57:10.281Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

On 1.25 releases I only see:
{"level":"warn","ts":"2023-03-23T19:01:14.227Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009b1500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}

This suggests that it is related to etcd client address AutoSync being enabled in k3s-io/k3s#6952 - but this was backported to 1.25 in k3s-io/k3s#6954 and it is working fine there, so there must be some additional complication; perhaps a change in behavior between etcd 3.5.4 and 3.5.5.

VestigeJ · 2023-03-23T22:02:38Z

I reproduced this when performing the cluster-reset from a secondary node not the originally targeted node for joining the cluster.

$ rke2 -v

rke2 version v1.25.8-rc1+rke2r1 (a8edcda62ba13bb226f1dc8a429f2e37c0e81df0)
go version go1.19.7 X:boringcrypto

Repeating error in stdout

truncated...

INFO[0456] Tunnel server egress proxy waiting for runtime core to become available 
{"level":"warn","ts":"2023-03-23T21:59:23.093Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009f4e00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0457] Failed to check local etcd status for learner management: context deadline exceeded 
W0323 21:59:23.093640   17764 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...

truncated...

CTRL-C

DEBU[0502] Wrote ping                                   
^CFATA[0505] failed to wait for apiserver ready: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": dial tcp 127.0.0.1:6443: connect: connection refused

Re-running cluster-reset command again on the same node immediately after cancelling results in the same error looping.

brandond · 2023-03-23T22:19:13Z

Yes, for some reason etcd comes up but the etcd client is having trouble talking to it, so it just hangs there. Only appears to happen with the combination of etcd 3.5.5 and AutoSync enabled, on RKE2. The same releases of K3s work fine.

brandond · 2023-03-24T20:41:20Z

Revisiting this today I am having trouble reproducing it. I suspect that there's a race condition in the etcd startup and client creation; I'll have to work on triggering it reliably.

brandond · 2023-03-24T21:31:55Z

OK, I've found the issue. The problem is that we are setting the listen-address to loopback during the cluster reset (to prevent external connections while resetting) but are still advertising the public IP. From the etcd logs:

"msg":"now serving peer/client/metrics",
"local-member-id":"40fd14fa28910cab",
"initial-advertise-peer-urls":["https://172.17.0.4:2380"],
"listen-peer-urls":["https://127.0.0.1:2380"],
"advertise-client-urls":["https://172.17.0.4:2379"],
"listen-client-urls":["https://127.0.0.1:2379"],
"listen-metrics-urls":["http://127.0.0.1:2381"]

The race condition is between cluster-reset completing its task using the etcd client connection, and the etcd client connection's endpoint auto-sync reading the advertised address and trying to use that instead of the loopback address. Something may have changed in between etcd 3.5.4 and 3.5.5 to make this race more likely to occur.

brandond · 2023-04-07T17:57:30Z

/backport v1.25.9+rke2r1

brandond · 2023-04-07T17:57:54Z

/backport v1.24.13+rke2r1

ShylajaDevadiga · 2023-04-10T15:16:28Z

Validated using commit id `be54040` on master branch

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu 20.04

Cluster Configuration:
3 server 1 agent node
cis and non-cis mode

Config.yaml:

cat /etc/rancher/k3s/config,yaml
token: >TOKEN>

Steps to reproduce the issue and validate the fix

Follow the steps to reproduce as mentioned in the issue

Results from reproducing the issue on v1.26.3+rke2r1

$ rke2 -v
rke2 version v1.26.3+rke2r1 (81b04f085bd73d7a285f63087489600fb011a7a4)

{"level":"warn","ts":"2023-04-10T06:24:17.790Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a05500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.31.13.143:2379: connect: connection refused\""}
{"level":"info","ts":"2023-04-10T06:24:17.790Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

Results from validating the fix using commit id

$ rke2 -v
rke2 version v1.26.3+dev.be54040d

$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-10-16.us-east-2.compute.internal   Ready    <none>                      85m   v1.26.3+rke2r1
ip-172-31-14-40.us-east-2.compute.internal   Ready    control-plane,etcd,master   89m   v1.26.3+rke2r1
ip-172-31-5-202.us-east-2.compute.internal   Ready    control-plane,etcd,master   86m   v1.26.3+rke2r1
ip-172-31-6-66.us-east-2.compute.internal    Ready    control-plane,etcd,master   87m   v1.26.3+rke2r1

$ sudo rke2 server --cluster-reset
...
INFO[0060] Tunnel server egress proxy waiting for runtime core to become available 
INFO[0063] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Successfully joined the nodes to the cluster after reset

$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-10-16.us-east-2.compute.internal   Ready    <none>                      94m   v1.26.3+rke2r1
ip-172-31-14-40.us-east-2.compute.internal   Ready    control-plane,etcd,master   98m   v1.26.3+rke2r1
ip-172-31-5-202.us-east-2.compute.internal   Ready    control-plane,etcd,master   95m   v1.26.3+rke2r1
ip-172-31-6-66.us-east-2.compute.internal    Ready    control-plane,etcd,master   95m   v1.26.3+rke2r1

brandond changed the title ~~Issue running cluster-reset on RHEL 9.1~~ Issue running cluster-reset on v1.26 releases Mar 23, 2023

brandond added this to the v1.26.4+rke2r1 milestone Mar 23, 2023

brandond self-assigned this Mar 23, 2023

This was referenced Mar 24, 2023

Fix race condition caused by etcd advertising addresses that it does not listen on k3s-io/k3s#7147

Merged

During cluster-reset, etcd advertises client address it does not listen on k3s-io/k3s#7148

Closed

brandond mentioned this issue Apr 7, 2023

Bump k3s and component versions for 2023-04 release #4096

Merged

3 tasks

rancherbot mentioned this issue Apr 7, 2023

[Backport release-1.25] Issue running cluster-reset on v1.26 releases #4101

Closed

rancherbot mentioned this issue Apr 7, 2023

[Backport release-1.24] Issue running cluster-reset on v1.26 releases #4102

Closed

ShylajaDevadiga self-assigned this Apr 7, 2023

ShylajaDevadiga closed this as completed Apr 10, 2023

This was referenced Apr 26, 2023

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down rancher/rancher#41080

Closed

Allow etcd snapshot restoration on a fresh set of etcd nodes rancher/rancher#41320

Merged

albinsun mentioned this issue Feb 1, 2024

[TEST] Upgrade the openSUSE image to 15.5, RKE1 & RKE2 versions to 1.27 in test CI harvester/tests#1087

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue running cluster-reset on v1.26 releases #4052

Issue running cluster-reset on v1.26 releases #4052

est-suse commented Mar 23, 2023

brandond commented Mar 23, 2023 •

edited

Loading

VestigeJ commented Mar 23, 2023

brandond commented Mar 23, 2023

brandond commented Mar 24, 2023

brandond commented Mar 24, 2023 •

edited

Loading

brandond commented Apr 7, 2023

brandond commented Apr 7, 2023

ShylajaDevadiga commented Apr 10, 2023

Issue running cluster-reset on v1.26 releases #4052

Issue running cluster-reset on v1.26 releases #4052

Comments

est-suse commented Mar 23, 2023

brandond commented Mar 23, 2023 • edited Loading

VestigeJ commented Mar 23, 2023

brandond commented Mar 23, 2023

brandond commented Mar 24, 2023

brandond commented Mar 24, 2023 • edited Loading

brandond commented Apr 7, 2023

brandond commented Apr 7, 2023

ShylajaDevadiga commented Apr 10, 2023

Validated using commit id be54040 on master branch

Environment Details

Steps to reproduce the issue and validate the fix

brandond commented Mar 23, 2023 •

edited

Loading

brandond commented Mar 24, 2023 •

edited

Loading

Validated using commit id `be54040` on master branch