Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge Case:- Rolling Update scales all machine sets to 0 #802

Closed
rishabh-11 opened this issue Mar 27, 2023 · 4 comments · Fixed by #803
Closed

Edge Case:- Rolling Update scales all machine sets to 0 #802

rishabh-11 opened this issue Mar 27, 2023 · 4 comments · Fixed by #803
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/blocker Needs to be resolved now, because it breaks the service status/closed Issue is closed (either delivered or triaged)

Comments

@rishabh-11
Copy link
Contributor

rishabh-11 commented Mar 27, 2023

How to categorize this issue?

/area robustness
/kind bug
/priority critical

What happened:
During the live update, we observed that both new and old machine sets were scaled down to 0. This happened because during the rolling update, the number of machines for the new machine set reached the allowedLimit and so it cannot add more machines to it code link. If we look at the code here, the nameToSize map is not populated for the corresponding machineSet, and here, we call to scale the machine set and pass 0 as a scale value.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:
The PR which caused this regression - #765

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@rishabh-11 rishabh-11 added the kind/bug Bug label Mar 27, 2023
@gardener-robot gardener-robot added area/robustness Robustness, reliability, resilience related priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 27, 2023
@vlerenc
Copy link
Member

vlerenc commented Mar 29, 2023

/blocker

We know of this catastrophic bug since 2d?

@gardener-robot gardener-robot added priority/blocker Needs to be resolved now, because it breaks the service and removed priority/critical Needs to be resolved soon, because it impacts users negatively labels Mar 29, 2023
@himanshu-kun
Copy link
Contributor

Yes work is ongoing, will raise PR by noon.

@vlerenc
Copy link
Member

vlerenc commented Mar 29, 2023

What is fastest, HotFix PR or do you have a short-term mitigation strategy that Gardener operators can apply? We probably cannot change the max settings across all clusters and pools, we cannot really disable reconciliation everywhere... is there anything an operator can do in the meantime to protect its clusters until the hotfix becomes available?

I mean, it’s a total melt-down, one of the worst events that can happen (besides losing ETCD, of course). MCM breaks absolutely everything in such a cluster. That’s pretty worrisome.

@himanshu-kun
Copy link
Contributor

PR raised #803 , in process of merge and cherry-pick
No I can't think of any way to stop this . Didn't get time to think when it could occur, from landscape perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug priority/blocker Needs to be resolved now, because it breaks the service status/closed Issue is closed (either delivered or triaged)
Projects
None yet
4 participants