Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is upgrade auto-finalization a good default? #57887

Open
nick-jones opened this issue Dec 14, 2020 · 5 comments
Open

Is upgrade auto-finalization a good default? #57887

nick-jones opened this issue Dec 14, 2020 · 5 comments
Labels
A-cluster-upgrades C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community X-blathers-triaged blathers was able to find an owner

Comments

@nick-jones
Copy link

nick-jones commented Dec 14, 2020

Is your feature request related to a problem? Please describe.

Twice now we've been in a position where moving to a new version has caused issues and a downgrade has been necessary:

In the first instance we unfortunately did not set preserve_downgrade_option (though I'm not 100% it would have helped in this instance). With the second issue we had set it, so managed to avoid any big catastrophe... though from the issue you can someone else wasn't quite so lucky: #57032 (comment)

The v20.2 upgrade documentation specifically states:

we recommend disabling auto-finalization so you can monitor the stability and performance of the upgraded cluster before finalizing the upgrade

This suggests that most people should be trying to avoid auto-finalization.

Describe the solution you'd like

My question is: is it sensible to default to auto-finalization, given some of the issues that pop up and the recommendations in your own documentation? I'm not actively watching issues here, so I don't have reasonable perspective on how often people get into a tangle as a result of this. I did, however, feel it was worth raising the question.

I fully understand requiring operators to take manual steps during upgrades is undesirable. I think it's worth weighing that up with fairly rapidly locking into a new version.

Describe alternatives you've considered

  • Having a flexible downgrade path regardless of what has happened would be an option, though I suspect considerable effort
  • Delay auto-finalization based on some default duration, perhaps considerable. This could be with some option to force finalization, when required.
  • There is also the option to have a cluster-wide setting to disable auto-finalization permanently and allowing operators to manually finalize the upgrade (i.e. as I understand this was the old behaviour). If there is a general preferance to retain auto-finalization, then this can be defaulted to "off".

Jira issue: CRDB-3471

@blathers-crl
Copy link

blathers-crl bot commented Dec 14, 2020

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

If we have not gotten back to your issue within a few business days, you can try the following:

  • Join our community slack channel and ask on #cockroachdb.
  • Try find someone from here if you know they worked closely on the area and CC them.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@blathers-crl blathers-crl bot added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Dec 14, 2020
@ajwerner
Copy link
Contributor

Thanks for raising the issue. It makes sense and is valid. One other note I'll make is that it's best to roll out the new version incrementally, by first testing one node and making sure everything looks good. Finalization cannot happen until all of the nodes are running the new version.

Also, even if we don't change the default, I think we can do more to sanity check the upgrade. Namely, we should probably do more to validate the state of the cluster is stable. Perhaps by waiting to make sure nodes have been up for a little while and also to check that the schema seems valid.

For what it's worth, in the next release (21.1) we'll be introducing a long-running migration framework to perform durable migrations which today take multiple versions. These migrations will not be able to run until after the cluster version has been finalized. They may be yet another reason to revisit the user story here.

In short, I think making the default be to auto-finalize but after a significantly longer amount of time (days?) with the caveat that all nodes stay up during that period seems much more reasonable to me.

@ajwerner
Copy link
Contributor

cc @vy-ton who's been thinking about the upgrade user story a bit lately.

@irfansharif
Copy link
Contributor

Some internal discussion here and here.

@github-actions
Copy link

github-actions bot commented Sep 5, 2023

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-upgrades C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

No branches or pull requests

4 participants