Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use atomic replication changes in RelocateRange #41084

Merged
merged 1 commit into from
Sep 27, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Sep 25, 2019

I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.


Touches #12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from danhhz September 25, 2019 16:46
transferLease(*leaseTarget)
}

ctx = context.WithValue(ctx, "testing", "testing")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Need to fix this hack; without it AdminChangeReplicas will run the changes one by one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did something.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: more straightforward that I was thinking it would be

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/client_relocate_range_test.go, line 36 at r1 (raw file):

	require.NoError(t, tc.Servers[0].DB().AdminRelocateRange(ctx, startKey.AsRawKey(), targets))
	var desc roachpb.RangeDescriptor
	require.NoError(t, tc.Servers[0].DB().GetProto(

why not tc.Servers[0].LookupRange?


pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):

	// A grab bag.
	{
		// s3 -(add)-> s3 s2 -(swap)-> s4 s2 -(add)-> s4 s2 s1

nit: the last part of this comment doesn't match


pkg/storage/client_relocate_range_test.go, line 172 at r1 (raw file):

			relocateAndCheck(t, tc, k, tc.Targets(1, 3, 0))
		})
		// s2 s4 s1 -(add)-> s2 s4 s1 s6

nit: the last part of this comment doesn't match


pkg/storage/replica_command.go, line 2020 at r1 (raw file):

		}
		for _, substr := range whitelist {
			if strings.Contains(err.Error(), substr) {

nit: probably better to be defensive and call err.Error() once


pkg/storage/replica_command.go, line 2064 at r1 (raw file):

		}

		ops, leaseTarget, err := s.relocateOne(ctx, &rangeDesc, targets)

check this err


pkg/storage/replica_command.go, line 2077 at r1 (raw file):

		ctx = context.WithValue(ctx, "testing", "testing")
		newDesc, err := s.DB().AdminChangeReplicas(

nit: seems like this should fit on one line


pkg/storage/replica_command.go, line 2085 at r1 (raw file):

			}
			if every.ShouldLog() {
				log.Warning(ctx, returnErr)

Warning seems a bit high for this given the "descriptor changed" stuff


pkg/storage/replica_command.go, line 2087 at r1 (raw file):

				log.Warning(ctx, returnErr)
			}
			re.Next()

nit: move this into the for {?


pkg/storage/replica_command.go, line 2105 at r1 (raw file):

	rangeReplicas := desc.Replicas().All()
	if len(rangeReplicas) != len(desc.Replicas().Voters()) {
		return nil, nil, crdberrors.AssertionFailedf(

looks like you lost the comment pointing out why this is safe to assert


pkg/storage/replica_command.go, line 2247 at r1 (raw file):

		liReq.Key = desc.StartKey.AsRawKey()
		b.AddRawRequest(liReq)
		if err := s.DB().Run(ctx, &b); err != nil {

we don't already have a helper for this?


pkg/storage/replica_command.go, line 2012 at r2 (raw file):

		//
		// TODO(tbg): remove in 20.1.
		ctx = context.WithValue(ctx, "testing", "testing")

we should update this value name now that's it's not just testing


pkg/storage/replica_command.go, line 2091 at r2 (raw file):

		}
		// When a swap is in order but we don't have atomic replication changes,
		// run the ops one by one.

do we need this? ChangeReplicas seems to unroll them

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! This is basically ready to be merged I think, but I made enough changes to wait for another pass. I've also noticed that the test I added sometimes takes ~1.5s but other times 11s.. looking into why, it seems that there are unexplained pauses sometimes, such as this one:

I190926 09:07:22.951158 4200 storage/replica_command.go:2075  [n4,s4,r25/9:/{Table/Max-Max}] TBG ====== LOOP START
I190926 09:07:22.951197 4200 storage/replica_command.go:2077  [n4,s4,r25/9:/{Table/Max-Max}]        TBG ====== ATTEMPT START
I190926 09:07:22.952095 4303 storage/replica_command.go:1583  [n3,s3,r25/12:/{Table/Max-Max}] change replicas (add [] remove [(n2,s2):8]): existing descriptor r25:/{Table/Max-Max} [(n5,s5):13, (n2,s2):8, (n3,s3):12, (n6,s6):11, next=14, gen=34, sticky=9223372036.854775807,2147483647]
I190926 09:07:25.843486 4303 storage/replica_raft.go:291  [n3,s3,r25/12:/{Table/Max-Max}] proposing REMOVE_REPLICA[(n2,s2):8]: after=[(n5,s5):13 (n6,s6):11 (n3,s3):12] next=14
I190926 09:07:25.844452 1120 storage/store.go:2660  [n2,s2,r25/8:/{Table/Max-Max}] removing replica r25/8

I'm going to spend some time later today figuring out what those are about, there's probably an opportunity to make replication more snappy across the board.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)


pkg/storage/client_relocate_range_test.go, line 168 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: the last part of this comment doesn't match

Added the reordered forms (I intentionally didn't reorder so that each step would be easier to follow along with).


pkg/storage/replica_command.go, line 2085 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Warning seems a bit high for this given the "descriptor changed" stuff

That's the old code, but I agree. Downgraded.


pkg/storage/replica_command.go, line 2087 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: move this into the for {?

I rearranged this a bit -- the for loop was both "do a thing at a time until no more things need doing" and "retry the current thing with a backoff". Pulled those apart so we now have one more layer of for loops and more idiomatic-looking backoff code.


pkg/storage/replica_command.go, line 2247 at r1 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

we don't already have a helper for this?

Not that I'm aware, I've seen helpers only in the testcluster machinery.


pkg/storage/replica_command.go, line 2091 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

do we need this? ChangeReplicas seems to unroll them

No, added a comment

	// When a swap is in order but we're not sure that all nodes are running
	// 19.2+ (in which the AdminChangeReplicas RPC was extended to support
	// mixing additions and removals), don't send such requests but unroll
	// the ops here, running them one by one. 19.2 nodes that don't have the
	// cluster version enabled would unroll them internally, but 19.1 nodes
	// don't even understand the RPC and would interpret all ops as
	// additions (which would fail since we can't add an existing node).

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I figured it out. TL;DR is that we sometimes end up removing the raft leader and then we have to wait out the 3s election timeout. Filed #41122 to fix this.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz)

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/storage/replica_command.go, line 2091 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

No, added a comment

	// When a swap is in order but we're not sure that all nodes are running
	// 19.2+ (in which the AdminChangeReplicas RPC was extended to support
	// mixing additions and removals), don't send such requests but unroll
	// the ops here, running them one by one. 19.2 nodes that don't have the
	// cluster version enabled would unroll them internally, but 19.1 nodes
	// don't even understand the RPC and would interpret all ops as
	// additions (which would fail since we can't add an existing node).

Aha, yes, good to have this comment


pkg/storage/replica_command.go, line 2075 at r3 (raw file):

	every := log.Every(time.Minute)
	for {
		for re := retry.StartWithCtx(ctx, retry.Options{MaxBackoff: 5 * time.Second}); ; re.Next() {

I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 5 of 5 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz and @tbg)


pkg/internal/client/db.go, line 564 at r3 (raw file):

// ChangeReplicasCanMixAddAndRemoveContext convinces
// (*client.DB).AdminChangeReplicas of that the caller is aware that 19.1 nodes

s/of //


pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):

				intercepted = append(intercepted, intercept{
					ops:         ops,
					leaseTarget: leaseTarget,

We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.


pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):

swap out replicas one by one

Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?


pkg/storage/replica_command.go, line 2087 at r3 (raw file):

				// Done.
				// Step 3: Transfer the lease to the first listed target, as the API specifies.
				transferLease(targets[0])

Should we just make relocateOne return this as the leaseTarget in this case?


pkg/storage/replica_command.go, line 2103 at r3 (raw file):

			opss := [][]roachpb.ReplicationChange{ops}
			if !useAtomic && len(ops) == 2 {

Why only if len(ops) == 2? If I'm reading the comment above, we'd want to do this whenever !useAtomic.

@tbg tbg force-pushed the atomic/relocate branch 6 times, most recently from 7e5a515 to f60f07e Compare September 27, 2019 09:54
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got errors under stressrace related to verifying the lease placement which I fixed; seems stable on my gceworker. In 19.2 we should look into making sure the lease transfers become mandatory, right now relocate will just fail weirdly if any of them don't go through (but oddly they seem to have a 100% success rate).

Thanks for the reviews!

bors r=danhhz

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @nvanbenschoten)


pkg/storage/client_relocate_range_test.go, line 108 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We don't seem to ever check this. Should we be? I don't see us checking the leaseholder anywhere else, but the comments imply that we know who it will become.

The test verifies that the lease ends up in the right place (see requireLeaseAt), it's hard to predict the exact lease transfers in between since the allocator ultimately gets to decide what order things happen in, and the allocator tends to be... not deterministic.


pkg/storage/client_relocate_range_test.go, line 138 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
swap out replicas one by one

Isn't this doing the opposite? Or do you mean that it's performing three atomic replica changes, one by one?

Yep. Clarified the comment.


pkg/storage/replica_command.go, line 2075 at r3 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I suspect some of the control flow gets easier to follow (breaks and success bool etc) if this inner loop moves into a method

Maybe, but the return err paths get more awkward. I'm going to leave as is for now since the whole opss thing goes away in 19.2 and with it the weird control flow.


pkg/storage/replica_command.go, line 2103 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why only if len(ops) == 2? If I'm reading the comment above, we'd want to do this whenever !useAtomic.

len(ops)==2 is either 1 or 2 and if it's 1 then we shouldn't make opss have an extra nil slice at the end. I added an assertion just above to make sure this is clear to readers.

craig bot pushed a commit that referenced this pull request Sep 27, 2019
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg

I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.

----

Touches #12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Sep 27, 2019

Build failed

Touches cockroachdb#12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in cockroachdb#12768.

PS: the newly added test passes when manually overriding useAtomic with
`false`, and I verified that it indeed uses no atomic replication
changes.

Release note: None
@tbg
Copy link
Member Author

tbg commented Sep 27, 2019

Linters were unhappy

bors r=danhhz

craig bot pushed a commit that referenced this pull request Sep 27, 2019
41084: storage: use atomic replication changes in RelocateRange r=danhhz a=tbg

I wrote this up rather quickly, but wanted to get this out for review sooner
rather than later.

----

Touches #12768.

Release justification: the previous code would enter vulnerable
replication configurations when it wasn't necessary, thus undermining
what we wanted to achieve in #12768.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@craig
Copy link
Contributor

craig bot commented Sep 27, 2019

Build succeeded

@craig craig bot merged commit d9412d3 into cockroachdb:master Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants