-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: hooks to help with cluster version upgrades #9404
Conversation
that we can delete the function. | ||
- The `information_schema` tables could be persisted, removing the specialness | ||
around leasing them. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like this idea but it needs to be spelled out in more detail for the current set of examples you list here. The reduction in complexity is not obvious from reading the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is valuable even if it's not a net reduction in complexity. As I said in #7073, I think that allowing every node to spam CREATE TABLE IF NOT EXISTS
commands at startup is not the best way to do migrations and will tie our hands when our needs get more complicated. I'd rather designate one node to manage the migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I reworked the doc quite a bit. Let me know if this is any more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, although it could use more details about the exact leasing process and how the SystemVersion
table gets bootstrapped.
migration hook. | ||
|
||
A new kv entry `SystemVersion` (ClusterVersion?) is introduced in the system | ||
keys to keep track of which of these hooks was the last to run. It stores a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In past systems I've found it useful to make the "version" table store a set of strings (i.e. all the migrations that have been applied) instead of a single value. This facilitates development by letting different migrations be merged to the mainline out of order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I was wary of overengineering, but since you're concerned about this too, I added something about to address it. (Plus I tweaked it a bit so you don't have to know the cockroach version your code is going into, which might not always be straightforward).
to run the `./cockroach cluster-upgrade` cli command using a binary of the | ||
version being upgraded to. For ease of deployment, rolling one node will also do | ||
this upgrade before its normal startup. Any nodes rolled after the first, but | ||
before the cluster is upgraded, will panic to make it very clear the roll is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems backwards to me. It's easier for the new code to be compatible with the old data than for the migration to be performed while the old code is still running. I would prefer to roll out new code everywhere (in backwards-compatibility mode) and then perform the update once there are no more old nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my gut feeling too, but if you're adding a system table, then you have to handle it not existing everywhere and there's no problem adding it with old code. I suspect many of our uses will fall into this camp.
And when you need the intermediate state, you can always do two migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you're going to continue hard-coding descriptors for these dynamically-created tables, you have to handle the case that the table exists but hasn't been loaded. So I think you could have a single piece of code that loads the table descriptor and creates it (or waits for someone else to do so) if it doesn't exist yet.
that we can delete the function. | ||
- The `information_schema` tables could be persisted, removing the specialness | ||
around leasing them. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is valuable even if it's not a net reduction in complexity. As I said in #7073, I think that allowing every node to spam CREATE TABLE IF NOT EXISTS
commands at startup is not the best way to do migrations and will tie our hands when our needs get more complicated. I'd rather designate one node to manage the migration.
58485d7
to
b005638
Compare
amount of work involved in this can be kept bounded using the `/SystemVersion` | ||
described below). If not, it runs them via a migration coordinator. (We should | ||
also do this check when a node rejoins a cluster it's been partitioned from, but | ||
I'm not sure how to do that.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? As long as the migration process completed, we know we won't regress. It may be worth repeating the check periodically to see if the node that started the migration process died without finishing it, but I don't see any reason to tie that to rejoining after a partition.
migration hook. | ||
|
||
Migration hooks consist of a name, a work function, and optionally a minimum | ||
version. In the steady state, there is a map in code between migration names and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that we need this map; I doubt we'll ever have enough migrations here that it's worthwhile to roll them up into a single value. Maintaining this map seems like it would be as much of a headache as giving all the migrations a version number instead of a name.
|
||
The work function must be idempotent (in case a migration coordinator crashes) | ||
and runnable on some previous version of the cluster. If there is some minimum | ||
cluster version it needs, that is checked against the one written to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would define the migrations in an ordered list in the code, and then always execute them in that order (skipping any that are already applied).
The migration name to version map can be used to order migrations when a cluster | ||
needs to catch up many versions. | ||
|
||
We will introduce a thin cli wrapper `./cockroach cluster-upgrade` around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like you're talking about running cluster-upgrade
with a new binary while the cluster is still running an old version. This sounds like it could be tricky from an operational logistics perspective. It's also troubling from a security perspective since this command would presumably need an all-powerful node certificate instead of a user certificate.
production scenarios. For ease of small clusters and local development, rolling | ||
one node will also do this upgrade before its normal startup. Any nodes rolled | ||
after the first, but before the cluster is upgraded, will panic to make it very | ||
clear the roll is unhealthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous. How does a node tell that it is not the first? (what if the first crashes?) What is the recovery procedure if you've rolled out the new binary everywhere and every node but one has crashed? We don't generally guarantee the ability to roll back to an older version.
Altering system tables will require more than one migration. For example, we'd | ||
like to remove the reference to `experimental_unique_bytes` from the | ||
`system.eventlog` table so we can delete the function. A migration is run to add | ||
a new column with a new name and the code in this Cockroach version sets both on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a new column? Don't we just need to change the default value to uuid_v4
? Or are you talking about changing it to an int?
`system.eventlog` table so we can delete the function. A migration is run to add | ||
a new column with a new name and the code in this Cockroach version sets both on | ||
mutations and fetches both (preferring the new one) on reads. Once the entire | ||
cluster is on this version, another code change stops reading and writing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Walk through the entire process of a multi-step update like this from the operator's perspective. Is it cluster-upgrade
(add column), rollout binary version N (read and write both), cluster-upgrade
(backfill new column), rollout binary version N+1 (stop using old column), cluster-upgrade
(delete old column), rollout binary version N+2 (maybe not necessary)?
That sounds terrible; we really don't want to put our users through a process like that. It's definitely not worth it to delete a tiny experimental function, although I assume we'll eventually have better reasons for making a similarly complex change. I think we need to be able to tolerate intermediate states so this can look like a single step (from the user's perspective).
We don't need to solve this part now; it's more important that we get something that works for the simple cases. But this is part of why I still think it would be better to roll out the new binary before starting the migration process.
cluster is on this version, another code change stops reading and writing the | ||
old value. Finally, a second mutation is run to remove the old column. This | ||
could be done in fewer steps if SQL had something like the "unknown fields" | ||
support in Protocol Buffers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth considering a SQL extension for this (or adopting one if another database has done it). This is a problem for user schema changes too.
to run the `./cockroach cluster-upgrade` cli command using a binary of the | ||
version being upgraded to. For ease of deployment, rolling one node will also do | ||
this upgrade before its normal startup. Any nodes rolled after the first, but | ||
before the cluster is upgraded, will panic to make it very clear the roll is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you're going to continue hard-coding descriptors for these dynamically-created tables, you have to handle the case that the table exists but hasn't been loaded. So I think you could have a single piece of code that loads the table descriptor and creates it (or waits for someone else to do so) if it doesn't exist yet.
Reviewed 1 of 1 files at r2. docs/RFCS/cluster_upgrade_tool.md, line 38 at r2 (raw file):
Is this a system-wide kv entry or a node-local kv entry? If the former, than how are upgrades supposed to be coordinated across multiple nodes? docs/RFCS/cluster_upgrade_tool.md, line 46 at r2 (raw file):
|
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful. docs/RFCS/cluster_upgrade_tool.md, line 61 at r2 (raw file):
|
Review status: all files reviewed at latest revision, 14 unresolved discussions, all commit checks successful. docs/RFCS/cluster_upgrade_tool.md, line 30 at r2 (raw file):
Would be good to be able to get a list of the migrations that are required to upgrade to a version. Should be easy to provide with the info available. docs/RFCS/cluster_upgrade_tool.md, line 55 at r2 (raw file):
Ah, per my earlier comment, there should be a docs/RFCS/cluster_upgrade_tool.md, line 61 at r2 (raw file):
|
@a-robinson I'm assuming you're taking over this PR? |
Correct. I've been tied up with some lingering views complexities, but I hope to have some amendments sent out for this by Monday. |
Sorry for the delay, @vivekmenezes! PTAL, I've expanded on the original RFC with a focus on getting something done in the short term while not limiting our longer term options. Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, some commit checks pending. docs/RFCS/cluster_upgrade_tool.md, line 46 at r2 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 46 at r2 (raw file):
|
(but that's expected from me). I particularly like the way you've handled the short term design while laying out how it plays into the options for later plans. please wait for @bdarnell to be happy before merging Review status: 0 of 1 files reviewed at latest revision, 14 unresolved discussions, some commit checks failed. Comments from Reviewable |
|
||
* Add a `system.jobs` table (#7073) | ||
* Add root user and authentication to the system.users table (#9877) | ||
* (maybe) Remove the reference to `experimental_unique_bytes` from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a "definitely", not a "maybe".
the hook version. | ||
|
||
Our most pressing initial motivations fall into the first model, so it will be | ||
the focus of this RFC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This model doesn't handle "changing the encoding of the RaftAppliedIndexKey
and LeaseAppliedIndexKey
keys" or "switching from nanoseconds to microseconds for storing timestamps and intervals". I think that's fine, but perhaps worth calling out that the types of migrations this enables is mostly constrained to system table and their schemas.
Review status: 0 of 1 files reviewed at latest revision, 16 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 24 at r3 (raw file):
|
ceb571d
to
3c5eec1
Compare
Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 24 at r3 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 75 at r3 (raw file):
|
Thanks for the reviews! Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 24 at r3 (raw file):
|
3c5eec1
to
05ade6f
Compare
Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions, some commit checks failed. docs/RFCS/cluster_upgrade_tool.md, line 208 at r4 (raw file):
|
05ade6f
to
bd426ec
Compare
Review status: 0 of 1 files reviewed at latest revision, 18 unresolved discussions. docs/RFCS/cluster_upgrade_tool.md, line 208 at r4 (raw file):
|
bd426ec
to
25950e0
Compare
Rebased to appease teamcity |
25950e0
to
dde63d2
Compare
dde63d2
to
6bf9050
Compare
Merged the two commits together at Tamir's suggestion. |
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks pending. docs/RFCS/cluster_upgrade_tool.md, line 151 at r5 (raw file):
Does the admin need to look at the release notes? Can you imagine an implementation where you point the new version at the current cluster and it run a diff and suggest commands that need to run before the cluster can be upgraded? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks pending. docs/RFCS/cluster_upgrade_tool.md, line 208 at r4 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, some commit checks pending. docs/RFCS/cluster_upgrade_tool.md, line 151 at r5 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, all commit checks successful. docs/RFCS/cluster_upgrade_tool.md, line 208 at r4 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 19 unresolved discussions, all commit checks successful. docs/RFCS/cluster_upgrade_tool.md, line 208 at r4 (raw file):
|
Reviewed 1 of 1 files at r5. docs/RFCS/cluster_upgrade_tool.md, line 20 at r5 (raw file):
heh, this heading is weird. Examples of what? Maybe it's clear when rendered (since this is a subhead of Motivation). Feel free to push back. docs/RFCS/cluster_upgrade_tool.md, line 48 at r5 (raw file):
"in terms which" -- something off about this sentence. docs/RFCS/cluster_upgrade_tool.md, line 59 at r5 (raw file):
Can we enforce this using the same mechanism as schema leases? Something to think about. docs/RFCS/cluster_upgrade_tool.md, line 82 at r5 (raw file):
nit: either backward-{in,}compatible, or backwards-{in,}compatible, but not both (the difference is the docs/RFCS/cluster_upgrade_tool.md, line 97 at r5 (raw file):
rather than "heartbeating a kv entry", can we describe this as a "migration lease"? I'm imagining the machinery to be virtually identical to range leases. docs/RFCS/cluster_upgrade_tool.md, line 101 at r5 (raw file):
"each missing migration" is mentioned twice in this sentence. docs/RFCS/cluster_upgrade_tool.md, line 105 at r5 (raw file):
Should be easy enough to smoke-test this in the migration coordinator tests (i.e. before any migrations have been added). docs/RFCS/cluster_upgrade_tool.md, line 115 at r5 (raw file):
s/created/creates/? docs/RFCS/cluster_upgrade_tool.md, line 142 at r5 (raw file):
s/and which/and the/ docs/RFCS/cluster_upgrade_tool.md, line 145 at r5 (raw file):
also might be helped by using a "version lease" (same mechanism as schema leases) docs/RFCS/cluster_upgrade_tool.md, line 168 at r5 (raw file):
nit:s/Cockroach\b/CockroachDB/g docs/RFCS/cluster_upgrade_tool.md, line 170 at r5 (raw file):
How do you determine you're a one-node cluster? docs/RFCS/cluster_upgrade_tool.md, line 172 at r5 (raw file):
s/startup, as/startup as/ docs/RFCS/cluster_upgrade_tool.md, line 226 at r5 (raw file):
s/short-term,/,/, s/backwards /backwards-/ docs/RFCS/cluster_upgrade_tool.md, line 233 at r5 (raw file):
"version leases" may be relevant here as well. docs/RFCS/cluster_upgrade_tool.md, line 256 at r5 (raw file):
s/or/of/ docs/RFCS/cluster_upgrade_tool.md, line 258 at r5 (raw file):
"version leases" again docs/RFCS/cluster_upgrade_tool.md, line 295 at r5 (raw file):
couple of docs/RFCS/cluster_upgrade_tool.md, line 300 at r5 (raw file):
trailing space, i think? Comments from Reviewable |
6bf9050
to
f475049
Compare
Thanks for the review! Review status: 0 of 1 files reviewed at latest revision, 38 unresolved discussions. docs/RFCS/cluster_upgrade_tool.md, line 20 at r5 (raw file):
|
Reviewed 1 of 1 files at r6. docs/RFCS/cluster_upgrade_tool.md, line 170 at r5 (raw file):
|
f475049
to
c39a9e3
Compare
Review status: all files reviewed at latest revision, 29 unresolved discussions, all commit checks successful. docs/RFCS/cluster_upgrade_tool.md, line 100 at r6 (raw file):
|
Reviewed 1 of 1 files at r7. Comments from Reviewable |
@vivekmenezes This is an idea I had that might help with #7073, among other things. WDYT?
This change is