-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sled-agent: Should PUT /omicron-zones
fail if it cannot remove a zone?
#6776
Comments
I think that as long as the zone failing to shutdown doesn't leave any in-memory state around that contradicts that the zone is still running that this should be fine. In other words, the shutdown and state management of the shutdown should be as atomic as possible. If the write to the ledger fails to happen this is still fine because the next update from nexus will hopefully cause it to succeed. If the zone is already down the shutdown attempt should succeed idempotently. I think the only problem we have to worry about in these cases is if the zone never shuts down or the writes to the ledger continuously fail. In that case we have a problem with the sled altogether. |
I agree this is a problem but I think it would be better not to fail the request, but to instead implement APIs such that:
We discussed a similar approach under #5086. You could say that the reconciler loop already exists in Nexus, so why not let this request fail and let Nexus keep trying to get it to the state it's trying to make it. But if you assume this sort of runtime error could come up at any point during zone creation or removal, I don't see how this converges. What if you've successfully removed one zone, then fail to remove another? You're neither in the new state nor the old state. Underlying this is my assumption that we generally should be able to identify most handleable errors before we make any changes. For example, if the zone config instructs the sled agent to create a dataset on a pool that does not exist, that's something we can check for before we accept the request. The other kinds of problems seem both very unlikely and also arbitrarily bad -- e.g., ENOMEM trying to fork, or some logic bug that causes sled agent to try to I think this has long been a theoretical problem -- is there a case that came up in practice where we ran into an error in these paths and wound up barreling on in the wrong state? |
Ooh, thanks, I didn't remember 5086. That plan does sound better.
No, still theoretical - Rain and I were talking about what guarantees the planner has about the current state of the system via the |
When sled-agent receives the set of zones it should be running, it attempts to remove any zones that are currently running and are not part of the new set, but failure to remove such zones are only logged and do not affect the result of the API request:
omicron/sled-agent/src/services.rs
Lines 3547 to 3550 in 13d411f
Additionally, sled-agent only attempts to remove a zone exactly once. The first thing
zone_bundle_and_try_remove
does is remove the zone from the in-memory list:omicron/sled-agent/src/services.rs
Lines 3557 to 3564 in 13d411f
It then tries to shut down the zone, logs any errors, and (later, after trying to start any new zones), records that in-memory list back into the ledger. That means we have this flow, where even if Nexus resends the
PUT
sled-agent won't retry the removal:PUT
with generation N + 1 that removes one zonePUT
request with generation N + 1; sled-agent thinks it's already there, so does nothing@sunshowers and I were chatting in the context of reconfigurator / zone cleanup, and we think it may be important to fail this request if any zone removals fail. A motivating example is removing an expunged Nexus and reconfigurator deciding whether any sagas assigned to that Nexus have been reassigned. Reconfigurator needs to check two things:
Critically: seeing that there are no sagas assigned to the expunged Nexus (item 2) alone is insufficient, since it's inherently racy if new assignments could still be made. It must first ensure item 1.
The working plan for reconfigurator is that it will base "has a zone actually been expunged" by inspecting the regularly-polled inventory. sled-agent reports its current omicron-zones config to inventory, but based on the above, this might be a lie if zone removal failed. "Zone expunged and inventory reports zone is gone" should be enough to satisfy item 1 above for Nexus and saga assignment, but isn't if sled-agent lies: the Nexus that sled-agent claims isn't running but actually is could continue to claim new sagas.
The sled-agent behavior was clearly intentional, and I haven't tried to dig into the history to figure out why. Can we change it to fail the
PUT
(and, critically, not update its ledger) if zone removal fails, instead of only logging?The text was updated successfully, but these errors were encountered: