Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Make daemonset controller stick to strategy.rollingUpdate.maxUnavailable #1323

Closed
miklezzzz opened this issue Jun 27, 2023 · 10 comments
Assignees
Labels
kind/feature-request wontfix This will not be worked on

Comments

@miklezzzz
Copy link

miklezzzz commented Jun 27, 2023

What would you like to be added:
At the moment, DaemonSet is allowed to delete more pods than maxUnavailable setting if pods fail the podutil.IsPodAvailable check. As I understand it was made so for the sake of decreasing the recovery time. At the same time, there is a probability for such circumstances to occur that may result in deleting all the pods of a Daemonset simultaneously. In case of using a Daemonset for scheduling some critical workloads like an Ingress controller's pods, it becomes a problem, as downtime is involved.
We use Openkruise DaemonSet controller pretty extensively and have already observed such problems.
The proposal is to have a way to make the DaemonSet controller strictly obey maxUnavailable settings. We ended up applying the following changes in our environment: miklezzzz@ef27767.

One really simple example is to have a DaemonSet with minReadySeconds set. If we update the DaemonSet twice in a period of time, smaller than minReadySeconds, the second update will have DaemonSet controller delete all the pods at once.
There are more examples, but a bit more elaborated.

Why is this needed:
It would allow a user to have finer control over the pace of a DaemonSet's updates.

@zmberg
Copy link
Member

zmberg commented Jun 28, 2023

@miklezzzz If remainingUnavailable always is 0, because some nodeToDaemonPod always not ready. Then DaemonSet will block the updating operation.

@miklezzzz
Copy link
Author

@zmberg HI! thanks for the response!
I couldn't reproduce such an issue, could you please elaborate more on that case?
As we now, remainingUnavailable is the product of maxUnavailable (1) sub numUnavailable and the latter gets incremented in the outer switch section. Even if I intentionally prevent some pods of a daemonset from running (so that they stuck in a crash loop), when I update the daemonset, it still tries to update at least one of the crashlooping pods and if the pod doesn't start up, the update process stucks, but in general it's how it should be.
p.s. I can provide a gif, recorded in a test cluster.

@stale
Copy link

stale bot commented Sep 26, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 26, 2023
@nabokihms
Copy link

@zmberg any news on this PR?

@stale stale bot removed the wontfix This will not be worked on label Sep 26, 2023
Copy link

stale bot commented Dec 27, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Dec 27, 2023
@nabokihms
Copy link

I'd still like to see an answer here.

@stale stale bot removed the wontfix This will not be worked on label Dec 27, 2023
Copy link

stale bot commented Apr 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Apr 1, 2024
@nabokihms
Copy link

/unstale

@stale stale bot removed the wontfix This will not be worked on label Apr 1, 2024
@zmberg
Copy link
Member

zmberg commented Apr 7, 2024

@nabokihms @miklezzzz
The overall logic of Advanced DaemonSet is an enhancement on top of the native K8S DaemonSet, and at the moment this piece of logic is really the logic of the follow community.

If the Pod is Not Ready, I think this logic makes more sense as it allows for quick recovery. Are you guys currently having problems because of the scenario with minReadySeconds set?
image

Copy link

stale bot commented Jul 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jul 7, 2024
@stale stale bot closed this as completed Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature-request wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants