Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: move off dead/decommissioning stores via atomic swaps #41190

Merged
merged 1 commit into from
Oct 1, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 30, 2019

This is the last known work item for #12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):

}

func requireOnlyAtomicChanges(

worth deduping this with the one in replicate_queue_test.go?


pkg/storage/client_raft_test.go, line 3044 at r1 (raw file):

	const single = 1

	// The range should drop down to one replica, neither of which is on a decommissioning store.

nit: "neither of which" doesn't sound right to me here


pkg/storage/replicate_queue.go, line 347 at r1 (raw file):

			return false, nil
		}
		removeIdx := -1 // guaranteed to be changed below

worth being defensive and putting an AssertionFailedf to verify this below (esp in case it rots)?


pkg/storage/replicate_queue.go, line 358 at r1 (raw file):

		decommissioningReplicas := rq.allocator.storePool.decommissioningReplicas(
			desc.RangeID, voterReplicas)
		if len(decommissioningReplicas) == 0 {

nit: slightly more clear (to me at least) if you move this below the for/range and make the condition removeIdx == -1


pkg/storage/replicate_queue.go, line 390 at r1 (raw file):

		return true, err
	default:
		log.Fatalf(ctx, "unknown allocator action %v", action)

log.Fatal seems excessive? thoughts on returning an error instead?


pkg/storage/replicate_queue.go, line 455 at r1 (raw file):

	}
	if removeIdx >= 0 && newStore.StoreID == existingReplicas[removeIdx].StoreID {
		return false, errors.Errorf("allocator suggested to replace replica on s%d with itself", newStore.StoreID)

AssertionFailedf? (this is borderline, but i like the documentation of intent that it gives)

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I had to skip the test under stressrace and I remembered to delete the test I wanted to delete in place of the new one (it used mtc, and was also skipped under stress).

Going to stress this a bit on my gceworker to make sure we're not immediately eating a flake. BTW this test can take a long time, presumably for the same reason as in #41122.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)


pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

worth deduping this with the one in replicate_queue_test.go?

Hmm, not an unreasonable ask, but I'm going to leave as is if you don't mind. I don't think the duplication does any harm and I have few cycles left this release.


pkg/storage/replicate_queue.go, line 358 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: slightly more clear (to me at least) if you move this below the for/range and make the condition removeIdx == -1

But then I have to guard the access to decommissioningReplicas[0]?


pkg/storage/replicate_queue.go, line 390 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

log.Fatal seems excessive? thoughts on returning an error instead?

Hmm, this should really never happen and if it did, probably the queue wouldn't work very well? Errors here tend to not show up on clients. I made this an error anyway since it's near impossible to imagine that an action would just not be implemented and that passing tests.


pkg/storage/replicate_queue.go, line 456 at r1 (raw file):

allocator suggested to replace repl

@danhhz
Copy link
Contributor

danhhz commented Sep 30, 2019 via email

@tbg
Copy link
Member Author

tbg commented Sep 30, 2019

TFTR! CI was red because I left an unused method around. My gceworker:

1715 runs completed, 0 failures, over 37m44

bors r=danhhz

@craig
Copy link
Contributor

craig bot commented Sep 30, 2019

Build failed (retrying...)

@craig
Copy link
Contributor

craig bot commented Sep 30, 2019

Build failed

This is the last known work item for cockroachdb#12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None
@tbg
Copy link
Member Author

tbg commented Sep 30, 2019

Ugh, had another unreachable code path. Fixed now and verified that make lint passes on gceworker.

bors r=danhhz

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @tbg)


pkg/storage/client_raft_test.go, line 2933 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm, not an unreasonable ask, but I'm going to leave as is if you don't mind. I don't think the duplication does any harm and I have few cycles left this release.

Yup, totally fine. Worth a TODO if you have to touch this again, but don't eat a CI cycle over it


pkg/storage/replicate_queue.go, line 358 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But then I have to guard the access to decommissioningReplicas[0]?

Sorry, brain fart

craig bot pushed a commit that referenced this pull request Oct 1, 2019
41190: storage: move off dead/decommissioning stores via atomic swaps r=danhhz a=tbg

This is the last known work item for #12768. Previously, the replicate
queue would use an add-and-remove strategy to get replicas off
decommissioning or dead stores. Now it uses swaps whenever it can,
subject to the usual restrictions around single-replica groups and
cluster versions.

Release justification: feature completes atomic replication changes.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Oct 1, 2019

Build succeeded

@craig craig bot merged commit 6df6142 into cockroachdb:master Oct 1, 2019
@knz knz mentioned this pull request Oct 2, 2019
18 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants