Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue running cluster-reset on v1.26 releases #4052

Closed
est-suse opened this issue Mar 23, 2023 · 8 comments
Closed

Issue running cluster-reset on v1.26 releases #4052

est-suse opened this issue Mar 23, 2023 · 8 comments
Assignees

Comments

@est-suse
Copy link
Contributor

Environmental Info:
RKE2 Version:

rke2 version v1.26.3-rc1+rke2r1 (81b04f085bd73d7a285f63087489600fb011a7a4)
go version go1.19.7 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

NAME="Red Hat Enterprise Linux"
VERSION="9.1 (Plow)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="9.1"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux 9.1 (Plow)"
ANSI_COLOR="0;31"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/9/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 9"
REDHAT_BUGZILLA_PRODUCT_VERSION=9.1
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.1"

Cluster Configuration:

3 servers - 1 agent

Describe the bug:

Steps To Reproduce:

1. Install RKE2 sudo curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_VERSION=v1.26.3-rc1+rke2r1 INSTALL_RKE2_CHANNEL=testing sh -

Then Follow the next steps:

2.Deploy some workloads.
3. Stop two server nodes by using "sudo rke2-killall.sh"
4. Make sure the cluster is not accessible now
5. Shut down the rke2 server on that remaining node by running "sudo systemctl stop rke2-server"
6. Now run "sudo rke2 server --cluster-reset"

Expected behavior:

NO error message should be displayed followed by this message:

INFO[0041] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Actual behavior:

The following error message is displayed:

FATA[0952] failed to wait for apiserver ready: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": read tcp 127.0.0.1:45194->127.0.0.1:6443: read: connection reset by peer - error from a previous attempt: read tcp 127.0.0.1:44986->127.0.0.1:6443: read: connection reset by peer

Additional context / logs:

@brandond brandond changed the title Issue running cluster-reset on RHEL 9.1 Issue running cluster-reset on v1.26 releases Mar 23, 2023
@brandond
Copy link
Member

brandond commented Mar 23, 2023

  • I can reproduce this issue on 1.26.3-rc1+rke2r1 and v1.26.2+rke2r1
  • I cannot reproduce this on v1.26.1+rke2r1
  • I cannot reproduce this on v1.25.7+rke2r1 or v1.25.8-rc1+rke2r1

The repeating error that I see on v1.26 is:

{"level":"warn","ts":"2023-03-23T16:57:10.281Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004de8c0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.17.0.4:2379: connect: connection refused\""}
{"level":"info","ts":"2023-03-23T16:57:10.281Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

On 1.25 releases I only see:
{"level":"warn","ts":"2023-03-23T19:01:14.227Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009b1500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}

This suggests that it is related to etcd client address AutoSync being enabled in k3s-io/k3s#6952 - but this was backported to 1.25 in k3s-io/k3s#6954 and it is working fine there, so there must be some additional complication; perhaps a change in behavior between etcd 3.5.4 and 3.5.5.

@VestigeJ
Copy link
Contributor

I reproduced this when performing the cluster-reset from a secondary node not the originally targeted node for joining the cluster.

$ rke2 -v

rke2 version v1.25.8-rc1+rke2r1 (a8edcda62ba13bb226f1dc8a429f2e37c0e81df0)
go version go1.19.7 X:boringcrypto

Repeating error in stdout

truncated...

INFO[0456] Tunnel server egress proxy waiting for runtime core to become available 
{"level":"warn","ts":"2023-03-23T21:59:23.093Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0009f4e00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
ERRO[0457] Failed to check local etcd status for learner management: context deadline exceeded 
W0323 21:59:23.093640   17764 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: authentication handshake failed: context canceled". Reconnecting...

truncated...

CTRL-C

DEBU[0502] Wrote ping                                   
^CFATA[0505] failed to wait for apiserver ready: timed out waiting for the condition, failed to get apiserver /readyz status: Get "https://127.0.0.1:6443/readyz": dial tcp 127.0.0.1:6443: connect: connection refused

Re-running cluster-reset command again on the same node immediately after cancelling results in the same error looping.

@brandond
Copy link
Member

Yes, for some reason etcd comes up but the etcd client is having trouble talking to it, so it just hangs there. Only appears to happen with the combination of etcd 3.5.5 and AutoSync enabled, on RKE2. The same releases of K3s work fine.

@brandond brandond added this to the v1.26.4+rke2r1 milestone Mar 23, 2023
@brandond brandond self-assigned this Mar 23, 2023
@brandond
Copy link
Member

Revisiting this today I am having trouble reproducing it. I suspect that there's a race condition in the etcd startup and client creation; I'll have to work on triggering it reliably.

@brandond
Copy link
Member

brandond commented Mar 24, 2023

OK, I've found the issue. The problem is that we are setting the listen-address to loopback during the cluster reset (to prevent external connections while resetting) but are still advertising the public IP. From the etcd logs:

"msg":"now serving peer/client/metrics",
"local-member-id":"40fd14fa28910cab",
"initial-advertise-peer-urls":["https://172.17.0.4:2380"],
"listen-peer-urls":["https://127.0.0.1:2380"],
"advertise-client-urls":["https://172.17.0.4:2379"],
"listen-client-urls":["https://127.0.0.1:2379"],
"listen-metrics-urls":["http://127.0.0.1:2381"]

The race condition is between cluster-reset completing its task using the etcd client connection, and the etcd client connection's endpoint auto-sync reading the advertised address and trying to use that instead of the loopback address. Something may have changed in between etcd 3.5.4 and 3.5.5 to make this race more likely to occur.

@brandond
Copy link
Member

brandond commented Apr 7, 2023

/backport v1.25.9+rke2r1

@brandond
Copy link
Member

brandond commented Apr 7, 2023

/backport v1.24.13+rke2r1

@ShylajaDevadiga
Copy link
Contributor

Validated using commit id be54040 on master branch

Environment Details

Infrastructure
Cloud EC2 instance

Node(s) CPU architecture, OS, and Version:
Ubuntu 20.04

Cluster Configuration:
3 server 1 agent node
cis and non-cis mode

Config.yaml:

cat /etc/rancher/k3s/config,yaml
token: >TOKEN>

Steps to reproduce the issue and validate the fix

Follow the steps to reproduce as mentioned in the issue

Results from reproducing the issue on v1.26.3+rke2r1

$ rke2 -v
rke2 version v1.26.3+rke2r1 (81b04f085bd73d7a285f63087489600fb011a7a4)
{"level":"warn","ts":"2023-04-10T06:24:17.790Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000a05500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 172.31.13.143:2379: connect: connection refused\""}
{"level":"info","ts":"2023-04-10T06:24:17.790Z","logger":"etcd-client","caller":"v3@v3.5.5-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}

Results from validating the fix using commit id

$ rke2 -v
rke2 version v1.26.3+dev.be54040d
$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-10-16.us-east-2.compute.internal   Ready    <none>                      85m   v1.26.3+rke2r1
ip-172-31-14-40.us-east-2.compute.internal   Ready    control-plane,etcd,master   89m   v1.26.3+rke2r1
ip-172-31-5-202.us-east-2.compute.internal   Ready    control-plane,etcd,master   86m   v1.26.3+rke2r1
ip-172-31-6-66.us-east-2.compute.internal    Ready    control-plane,etcd,master   87m   v1.26.3+rke2r1

$ sudo rke2 server --cluster-reset
...
INFO[0060] Tunnel server egress proxy waiting for runtime core to become available 
INFO[0063] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes 

Successfully joined the nodes to the cluster after reset

$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-10-16.us-east-2.compute.internal   Ready    <none>                      94m   v1.26.3+rke2r1
ip-172-31-14-40.us-east-2.compute.internal   Ready    control-plane,etcd,master   98m   v1.26.3+rke2r1
ip-172-31-5-202.us-east-2.compute.internal   Ready    control-plane,etcd,master   95m   v1.26.3+rke2r1
ip-172-31-6-66.us-east-2.compute.internal    Ready    control-plane,etcd,master   95m   v1.26.3+rke2r1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants