-
Notifications
You must be signed in to change notification settings - Fork 39.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Stuck in Terminating state if tainted node is power cycled.... #118286
Comments
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Not seen in kubernetes 1.26.1 |
/sig node |
@smarterclayton @wojtek-t @bobbypage any pointers? |
/cc |
/triage needs-information Can you share the kubelet.log after restart? Maybe also before restart to see what state those Pods were at the moment of power cycle. What is the time difference between immediately un-tainting or after node come up. What is immediately here? |
This issue is reproduced with k8s 1.27.2 as well. After tainting the node, within 5 seconds the node is power cycled. Some of the pods are stuck in "Terminating" state. Tried restarting kubelet multiple times, pods are not removed. But as soon as the taint is removed and the kubelet is restarted, the "Terminating" pods are removed. |
I reproduced this issue please find kubelet logs : |
Can you please provide the I tried to play around with a kind cluster to repro this but was unable to. Can you please provide the more detailed repro instructions? In my repro attempt, the pod was in terminating until the kubelet restarted. After the kubelet restarted, the pod ends up getting deleted. |
Here's my repro attempt using a kind cluster... after the kubelet is restarted, the pod is deleted, so I am not able to repro the behavior mentioned in the issue. Is there something I'm missing here?
|
hi @bobbypage systemctl stop kubelet is a graceful shutdown right? I was able to reproduce the issue with poweroff -f --reboot, which is not a graceful shutdown. may be kill -9 {kubelet pid} might create this issue. -Nobin |
Probably you need to increase the number of pod replicas ! |
Here is yaml for a pod stucking in terminating state :
|
@bobbypage |
@bobbypage do you have enough to repro? |
@bobbypage @SergeyKanzhelev any other information needed |
Thanks @mboukhalfa I was able to repro now, I followed the same exact steps as #118286 (comment) but instead of running 1 replica, I ran a deployment with 50 replicas.
Here is one of the pods that is stuck in "Terminating" as per kubectl
Note: The phase is The kubelet logs - https://gist.github.com/bobbypage/d8f59c6d73d9527fbef3e5d8d1fe3f33
which is same as in #118472 (comment) |
I expect that this is a duplicate of #118472 |
@bobbypage thanks I have noticed that a fix for that issue already merged #118497 and cherry picked in the open PR #118841 I am waiting for a release with one of them so I can test if fixed the issue here. |
I just tested with v1.27.4 and this issue did not happen ! As you expected @bobbypage the PRs related ti #118472 fixed this as well |
/close |
@rphillips: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened?
If I power cycle the node after tainting the node, certain pods are stuck in "Terminating" state(these pods usually take more time to delete)
kubectl taint nodes control-plane-dc275-2-master-n5-1003 node-role.kubernetes.io/master:NoExecute
poweroff -f --reboot
pod is stuck in Terminating state.
But If I remove the taint immediately after poweroff, eventually pod will get removed.
If I untainted after the node comes up, the pod is stuck in terminating state.
Kubernetes Version: v1.27.1
What did you expect to happen?
Pod should get removed
How can we reproduce it (as minimally and precisely as possible)?
kubectl taint nodes control-plane-dc275-2-master-n5-1003 node-role.kubernetes.io/master:NoExecute
poweroff -f --reboot
Anything else we need to know?
Pods stuck in Terminated state usually take more time to get deleted
Kubernetes version
v1.27.1
Cloud provider
none
OS version
SLES 15 SP4
Install tools
none
Container runtime (CRI) and version (if applicable)
containerd
Related plugins (CNI, CSI, ...) and versions (if applicable)
No response
The text was updated successfully, but these errors were encountered: