-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: move off dead/decommissioning stores via atomic swaps #41190
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)
pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):
} func requireOnlyAtomicChanges(
worth deduping this with the one in replicate_queue_test.go?
pkg/storage/client_raft_test.go, line 3044 at r1 (raw file):
const single = 1 // The range should drop down to one replica, neither of which is on a decommissioning store.
nit: "neither of which" doesn't sound right to me here
pkg/storage/replicate_queue.go, line 347 at r1 (raw file):
return false, nil } removeIdx := -1 // guaranteed to be changed below
worth being defensive and putting an AssertionFailedf to verify this below (esp in case it rots)?
pkg/storage/replicate_queue.go, line 358 at r1 (raw file):
decommissioningReplicas := rq.allocator.storePool.decommissioningReplicas( desc.RangeID, voterReplicas) if len(decommissioningReplicas) == 0 {
nit: slightly more clear (to me at least) if you move this below the for/range and make the condition removeIdx == -1
pkg/storage/replicate_queue.go, line 390 at r1 (raw file):
return true, err default: log.Fatalf(ctx, "unknown allocator action %v", action)
log.Fatal seems excessive? thoughts on returning an error instead?
pkg/storage/replicate_queue.go, line 455 at r1 (raw file):
} if removeIdx >= 0 && newStore.StoreID == existingReplicas[removeIdx].StoreID { return false, errors.Errorf("allocator suggested to replace replica on s%d with itself", newStore.StoreID)
AssertionFailedf? (this is borderline, but i like the documentation of intent that it gives)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I had to skip the test under stressrace and I remembered to delete the test I wanted to delete in place of the new one (it used mtc, and was also skipped under stress).
Going to stress this a bit on my gceworker to make sure we're not immediately eating a flake. BTW this test can take a long time, presumably for the same reason as in #41122.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)
pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
worth deduping this with the one in replicate_queue_test.go?
Hmm, not an unreasonable ask, but I'm going to leave as is if you don't mind. I don't think the duplication does any harm and I have few cycles left this release.
pkg/storage/replicate_queue.go, line 358 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: slightly more clear (to me at least) if you move this below the for/range and make the condition
removeIdx == -1
But then I have to guard the access to decommissioningReplicas[0]
?
pkg/storage/replicate_queue.go, line 390 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
log.Fatal seems excessive? thoughts on returning an error instead?
Hmm, this should really never happen and if it did, probably the queue wouldn't work very well? Errors here tend to not show up on clients. I made this an error anyway since it's near impossible to imagine that an action would just not be implemented and that passing tests.
pkg/storage/replicate_queue.go, line 456 at r1 (raw file):
allocator suggested to replace repl
FYI I’m not going to get a chance to look at this again today so feel free to merge before I do. Everything I had was minor, so if I do find something on a second pass we can do a followup PR
… On Sep 30, 2019, at 12:29 PM, Tobias Grieger ***@***.***> wrote:
@tbg commented on this pull request.
TFTR! I had to skip the test under stressrace and I remembered to delete the test I wanted to delete in place of the new one (it used mtc, and was also skipped under stress).
Going to stress this a bit on my gceworker to make sure we're not immediately eating a flake. BTW this test can take a long time, presumably for the same reason as in #41122.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)
pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
Hmm, not an unreasonable ask, but I'm going to leave as is if you don't mind. I don't think the duplication does any harm and I have few cycles left this release.
pkg/storage/replicate_queue.go, line 358 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
But then I have to guard the access to decommissioningReplicas[0]?
pkg/storage/replicate_queue.go, line 390 at r1 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
Hmm, this should really never happen and if it did, probably the queue wouldn't work very well? Errors here tend to not show up on clients. I made this an error anyway since it's near impossible to imagine that an action would just not be implemented and that passing tests.
pkg/storage/replicate_queue.go, line 456 at r1 (raw file):
allocator suggested to replace repl
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
TFTR! CI was red because I left an unused method around. My gceworker:
bors r=danhhz |
Build failed (retrying...) |
Build failed |
This is the last known work item for cockroachdb#12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None
Ugh, had another unreachable code path. Fixed now and verified that bors r=danhhz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @tbg)
pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Hmm, not an unreasonable ask, but I'm going to leave as is if you don't mind. I don't think the duplication does any harm and I have few cycles left this release.
Yup, totally fine. Worth a TODO if you have to touch this again, but don't eat a CI cycle over it
pkg/storage/replicate_queue.go, line 358 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
But then I have to guard the access to
decommissioningReplicas[0]
?
Sorry, brain fart
41190: storage: move off dead/decommissioning stores via atomic swaps r=danhhz a=tbg This is the last known work item for #12768. Previously, the replicate queue would use an add-and-remove strategy to get replicas off decommissioning or dead stores. Now it uses swaps whenever it can, subject to the usual restrictions around single-replica groups and cluster versions. Release justification: feature completes atomic replication changes. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Build succeeded |
This is the last known work item for #12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.
Release justification: feature completes atomic replication changes.
Release note: None