Is upgrade auto-finalization a good default? #57887

nick-jones · 2020-12-14T11:37:56Z

Is your feature request related to a problem? Please describe.

Twice now we've been in a position where moving to a new version has caused issues and a downgrade has been necessary:

sqlmigrations: v20.1.0: out of bounds panic #48786
release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable #57032

In the first instance we unfortunately did not set preserve_downgrade_option (though I'm not 100% it would have helped in this instance). With the second issue we had set it, so managed to avoid any big catastrophe... though from the issue you can someone else wasn't quite so lucky: #57032 (comment)

The v20.2 upgrade documentation specifically states:

we recommend disabling auto-finalization so you can monitor the stability and performance of the upgraded cluster before finalizing the upgrade

This suggests that most people should be trying to avoid auto-finalization.

Describe the solution you'd like

My question is: is it sensible to default to auto-finalization, given some of the issues that pop up and the recommendations in your own documentation? I'm not actively watching issues here, so I don't have reasonable perspective on how often people get into a tangle as a result of this. I did, however, feel it was worth raising the question.

I fully understand requiring operators to take manual steps during upgrades is undesirable. I think it's worth weighing that up with fairly rapidly locking into a new version.

Describe alternatives you've considered

Having a flexible downgrade path regardless of what has happened would be an option, though I suspect considerable effort
Delay auto-finalization based on some default duration, perhaps considerable. This could be with some option to force finalization, when required.
There is also the option to have a cluster-wide setting to disable auto-finalization permanently and allowing operators to manually finalize the upgrade (i.e. as I understand this was the old behaviour). If there is a general preferance to retain auto-finalization, then this can be defaulted to "off".

Jira issue: CRDB-3471

The text was updated successfully, but these errors were encountered:

blathers-crl · 2020-12-14T11:38:00Z

Hello, I am Blathers. I am here to help you get the issue triaged.

I have CC'd a few people who may be able to assist you:

@lucy-zhang (assigned to sqlmigrations: v20.1.0: out of bounds panic #48786, assigned to release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable #57032)
@yuzefovich (commented on sqlmigrations: v20.1.0: out of bounds panic #48786)
@jordanlewis (commented on release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable #57032)
@ajwerner (assigned to release-20.2: sql: failure to upgrade FK representation during table validation produces spurious errors and makes table unavailable #57032)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

ajwerner · 2020-12-14T16:24:24Z

Thanks for raising the issue. It makes sense and is valid. One other note I'll make is that it's best to roll out the new version incrementally, by first testing one node and making sure everything looks good. Finalization cannot happen until all of the nodes are running the new version.

Also, even if we don't change the default, I think we can do more to sanity check the upgrade. Namely, we should probably do more to validate the state of the cluster is stable. Perhaps by waiting to make sure nodes have been up for a little while and also to check that the schema seems valid.

For what it's worth, in the next release (21.1) we'll be introducing a long-running migration framework to perform durable migrations which today take multiple versions. These migrations will not be able to run until after the cluster version has been finalized. They may be yet another reason to revisit the user story here.

In short, I think making the default be to auto-finalize but after a significantly longer amount of time (days?) with the caveat that all nodes stay up during that period seems much more reasonable to me.

ajwerner · 2020-12-14T16:24:49Z

cc @vy-ton who's been thinking about the upgrade user story a bit lately.

irfansharif · 2020-12-14T19:18:30Z

Some internal discussion here and here.

github-actions · 2023-09-05T11:11:08Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

blathers-crl bot added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Dec 14, 2020

irfansharif mentioned this issue Dec 17, 2021

*: reconsider cluster.preserve_downgrade_option #70983

Open

github-actions bot added the no-issue-activity label Sep 5, 2023

yuzefovich added A-cluster-upgrades and removed no-issue-activity labels Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is upgrade auto-finalization a good default? #57887

Is upgrade auto-finalization a good default? #57887

nick-jones commented Dec 14, 2020 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Dec 14, 2020

ajwerner commented Dec 14, 2020

ajwerner commented Dec 14, 2020

irfansharif commented Dec 14, 2020

github-actions bot commented Sep 5, 2023

Is upgrade auto-finalization a good default? #57887

Is upgrade auto-finalization a good default? #57887

Comments

nick-jones commented Dec 14, 2020 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Dec 14, 2020

ajwerner commented Dec 14, 2020

ajwerner commented Dec 14, 2020

irfansharif commented Dec 14, 2020

github-actions bot commented Sep 5, 2023

nick-jones commented Dec 14, 2020 •

edited by cockroach-jira-scripts

Loading