-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add cluster upgrade logic for backward-compatible migrations #11658
Conversation
accd9f8
to
9f29b78
Compare
Reviewed 4 of 4 files at r1, 4 of 4 files at r2, 1 of 1 files at r3. pkg/internal/client/lease.go, line 128 at r3 (raw file):
This should probably use a conditional put instead of manually checking for the expected value. pkg/internal/client/lease.go, line 169 at r3 (raw file):
Also use a conditional put here. pkg/migrations/migrations.go, line 111 at r3 (raw file):
I think I'd rather run the migrations at new cluster initialization than have tables be created one way for new clusters and a different way for old clusters. We shouldn't have too many of these and they should be pretty fast on an empty cluster. (we could pass a flag in to tell the work functions that the cluster is starting from scratch if that could save a significant amount of work) pkg/migrations/migrations.go, line 122 at r3 (raw file):
We should check whether any migrations are necessary before grabbing the lease (and then again after we have it). In the common case we won't have to do anything so we won't have to have every new server process do two writes to the lease record. pkg/migrations/migrations.go, line 142 at r3 (raw file):
Returning an error here kills the server. Instead, we need to wait for the lease holder to finish its migrations (or for the lease to expire so we can pick up the work). pkg/migrations/migrations.go, line 166 at r3 (raw file):
This seems harsh. Instead, could we use the lease expiration as a deadline for the transactions used by the work functions (similar to the way we use table metadata leases)? pkg/migrations/migrations.go, line 219 at r3 (raw file):
We don't generally set MaxRetries in our retry loops, we just keep looping until they succeed. Or we just run things once and return the error (client.DB has internal retries so I don't think we generally need to wrap another loop around a single Put). Why do we loop here but we don't call the workFn in a loop, for example? Comments from Reviewable |
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. pkg/internal/client/lease.go, line 51 at r1 (raw file):
nit: ClientID pkg/migrations/migrations.go, line 118 at r2 (raw file):
I'm not a fan of the inline bool comments. They're harder to wrap, the stutter makes it harder to read, and it's too easy for them to rot. A little better is this pattern
pkg/migrations/migrations.go, line 147 at r2 (raw file):
maybe I need to rtfm, but how is a buffered channel of size 1 different than an unbuffered channel? pkg/migrations/migrations.go, line 176 at r2 (raw file):
you could tighten the scope on this by moving it inside the retry and completedMigrations up here pkg/migrations/migrations.go, line 181 at r2 (raw file):
Either use 0 (unlimited) or actually do the paging, plz pkg/migrations/migrations.go, line 190 at r2 (raw file):
we use pkg/migrations/migrations.go, line 237 at r2 (raw file):
why not pkg/migrations/migrations.go, line 122 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Grabbing the lease before checking if there are migrations also linearizes startup of nodes that you bring up at the same time, no? pkg/migrations/migrations_test.go, line 136 at r2 (raw file):
if you pass an empty string to t.Run, it numbers them for you pkg/server/server.go, line 505 at r2 (raw file):
this might be a bit too paranoid, but you could select between this and the s.stopper.stopper channel in case this node gets ctrl-c'd before the migrations finish pkg/server/server.go, line 659 at r2 (raw file):
Comments from Reviewable |
9f29b78
to
117fc42
Compare
Thanks for the reviews! Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. pkg/internal/client/lease.go, line 51 at r1 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Done. pkg/internal/client/lease.go, line 128 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I kind of like having the added safety of detecting if our local state is incorrect and immediately clearing out our local lease if we're not actually the owner. We certainly don't expect that to ever happen, but if it did this would protect us against multiple nodes thinking they're the leaseholder for too long. Do you think the safety check isn't worth the extra code? pkg/internal/client/lease.go, line 169 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. The change will mean we don't release the lease if we're wrong about its expiration time, but that'll just be annoying, not dangerous. We also don't expect it to ever happen, of course. pkg/migrations/migrations.go, line 118 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Sure, done. pkg/migrations/migrations.go, line 147 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
You can send one item on it without blocking: It's not super important, but it allows us to release the lease and continue on with starting up the server without needing to block on the extend task. pkg/migrations/migrations.go, line 176 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Eh, I realize it's pretty subjective, but I find it more readable when the code in the retry as short and simple as possible rather than sticking work inside of it that doesn't actually need to be retried. pkg/migrations/migrations.go, line 181 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Ah, unlimited is perfect. Although I certainly hope we don't ever have 10k migrations. pkg/migrations/migrations.go, line 190 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
For memory usage reasons? I've always seen/used bools just because it makes reads against the map really simple, since the default value of bool works perfectly. And we have at least a handful of places using a pattern like this (albeit sql/drop.go was added by me):
pkg/migrations/migrations.go, line 237 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
No good reason. Switched. pkg/migrations/migrations.go, line 111 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I agree it'd be ideal if all tables were created/modified the same way regardless of whether you're running an old cluster or a new cluster. However, I'm worried that might make some unit tests pretty awkward though -- for example, will SQL tests that rely on the I can double check whether there's any effect after working through #5887, since my worries may be completely unfounded. pkg/migrations/migrations.go, line 122 at r3 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Good point, it's as if I forgot that our kv store could handle concurrent operations 🐶 pkg/migrations/migrations.go, line 142 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That's fair, my time at Google probably influenced me a little too much here -- I've been conditioned to prefer failing fast at startup rather than waiting a long time, but I suppose that's the wrong tradeoff in a lot of other environments. Switched to retrying indefinitely. pkg/migrations/migrations.go, line 166 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
If I'm understanding correctly, that would assume that the work functions will use kv transactions and that they won't take longer than the initial lease duration, since it looks like we don't extend transactions' deadlines. I'll cancel the runner's context instead, given that that's a little more graceful. Is that sufficient, or do you think we should do more? pkg/migrations/migrations.go, line 219 at r3 (raw file): I've also removed the retry loop when getting the list of completed migrations.
I figured the details of the workFn could handle its own retries internally based on the particular logic involved. pkg/migrations/migrations_test.go, line 136 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Well that's handy. Done. We have a good handful of other tests that do this too. I'll send out a separate PR to clean those up. pkg/server/server.go, line 505 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Good call, although we'll have to fix #10138 before the stopper will ever really help on a real node. pkg/server/server.go, line 659 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Done. Although after adding in the int cast, I'm not sure it's much better. Comments from Reviewable |
Review status: 3 of 9 files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. pkg/internal/client/lease.go, line 128 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
A conditional put failure returns a ConditionFailedError, which you can catch and use to do these extra safety checks pkg/migrations/migrations.go, line 147 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Makes sense pkg/migrations/migrations.go, line 190 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Do the same search for `git grep "map[.*]struct" : - ) Yeah less memory was the original motivation, but in this case it's more for consistency pkg/migrations/migrations.go, line 142 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
I think the people I've asked about this are fairly evenly divided on whether to block or exit. I was originally imaging erroring because at Foursquare that would have made it obvious much more quickly that your roll was bad Comments from Reviewable |
117fc42
to
3660725
Compare
Review status: 2 of 9 files reviewed at latest revision, 17 unresolved discussions. pkg/internal/client/lease.go, line 128 at r3 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Ah, thanks for pointing that out. pkg/migrations/migrations.go, line 190 at r2 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Touche :) I can't say I've changed my preference yet, but I've changed the code. pkg/migrations/migrations.go, line 142 at r3 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
Right, that's been my experience as well. I'll wait for Ben to chime in again before changing anything. Comments from Reviewable |
Reviewed 2 of 7 files at r4, 4 of 4 files at r5, 1 of 1 files at r6. pkg/migrations/migrations.go, line 190 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
It's not just for the memory savings - the pkg/migrations/migrations.go, line 142 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
If another node acquired the lease, then they're working on the migrations and may need this node to be up for those migrations to complete. We need to treat "another node has the lease" separately from other errors (for other errors, bailing out early is fine). pkg/migrations/migrations.go, line 166 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yes, that's assuming that the migrations would either use KV transactions or we'd build a similar deadline mechanism for whatever they use instead. We may not extend transactions' deadlines today but I think we could if we wanted to (the only time the deadline really matters is on EndTransaction). Canceling the context is fine, but no matter what we do here there's going to be a chance of two migrations running concurrently for at least a short time. Only a deadline in the migrations themselves can guard against that completely. pkg/server/server.go, line 659 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
This could be Comments from Reviewable |
Review status: all files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. pkg/migrations/migrations.go, line 142 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Gah, of course! Thanks for bringing that up. That hadn't come up in an issue in my testing because the migrations I was running were fast enough that the other nodes didn't time out while waiting for the lease. I'll leave this as is to just continue blocking and logging the error, but I've added an error type to the lease client in case another use case needs to differentiate or we decide to bail out for other errors. pkg/migrations/migrations.go, line 166 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Crashing the server could also guard against that pretty completely, but I understand your point. I may be missing some subtlety here, but at a high level it seems like as long as the migrations are idempotent (which we should be requiring anyway), wrapping them in ACID transactions should mean migrations running on the same time on different machines should actually be safe anyways, since they'll be serialized anyway. What change do you propose I make here? Implement a way to pass an updatable deadline to the migrations' transaction? Institute a standard practice of the migrations checking the lease expiration time before attempting to commit their changes? Leave it as is? Go back to crashing? pkg/server/server.go, line 659 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done, thanks. Comments from Reviewable |
7b7bf34
to
1944f02
Compare
Reviewed 2 of 7 files at r7, 4 of 4 files at r8, 1 of 1 files at r9. pkg/migrations/migrations.go, line 111 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Yeah, my thinking is that tests would run through the migrations on startup and we'd create the initial tables using their original FormatVersion and migrate forward. I'm expecting that these migrations will be rare enough (and cheap enough on empty in-memory databases) that it won't be too painful to add a little setup work. I'm willing to change my mind here if it's too expensive to always go through the migration process, but I think I'd rather start by always doing it the same way. pkg/migrations/migrations.go, line 166 at r3 (raw file): Previously, a-robinson (Alex Robinson) wrote…
Crashing the server still leaves a small race - what if we blocked in ExtendLease for an unexpectedly long time and we're now well past our lease expiration? But you're right that we'll need to make these migrations idempotent (if they're not transaction-based), so maybe it doesn't matter. I think for now let's go back to crashing if we're unable to extend the lease; this is something to keep in mind as we're implementing actual migrations. Comments from Reviewable |
1944f02
to
411b40c
Compare
Review status: all files reviewed at latest revision, 15 unresolved discussions, all commit checks successful. pkg/migrations/migrations.go, line 111 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Alright, removed all the InitNewCluster stuff. I hope it works out this way! pkg/migrations/migrations.go, line 166 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point, makes sense. Switched back to killing the process for now, along with a note that we may want to improve it. Comments from Reviewable |
411b40c
to
bb727fe
Compare
but you should probably wait for @bdarnell. Thanks for picking this off! Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. pkg/server/server.go, line 515 at r11 (raw file):
I'm not sure either way, but maybe this should be ShouldQuiese Comments from Reviewable |
bb727fe
to
4cc1a79
Compare
Review status: 4 of 9 files reviewed at latest revision, 16 unresolved discussions, all commit checks successful. pkg/server/server.go, line 515 at r11 (raw file): Previously, paperstreet (Daniel Harrison) wrote…
You're right, done. Comments from Reviewable |
Reviewed 3 of 4 files at r11, 1 of 1 files at r12, 1 of 1 files at r13. Comments from Reviewable |
Built to initially be used by the migration code since our existing lease mechanisms for ranges and SQL schemas aren't very generalizable.
This is needed to allow us to make certain changes between versions that wouldn't be safe otherwise.
Now that there's a delay between when a node starts serving kv and when it starts serving SQL (due to the SQL migration code path), acceptance tests that rely on SQL can have a bad time without this.
4cc1a79
to
967327d
Compare
As outlined in #9404.
I have yet to use this to make any substantial changes to a cluster -- I expect that to make doing so convenient this will have to move from its own package into the sql package. For example, it's not easy to add a new system table from outside the SQL package, since that sort of thing doesn't go through the public interfaces.
Anyway, while I put together a more proper SQL migration (such as creating
system.jobs
) and an approach for testing specific real migrations this afternoon, the core of the code is ready for review.cc @vivekmenezes @paperstreet
This change is