Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: avoid crash in splitPostApply #39658

Merged
merged 1 commit into from
Aug 14, 2019

Conversation

tbg
Copy link
Member

@tbg tbg commented Aug 14, 2019

See #39571.

The above PR was incomplete because we also need to wind back the
replicaID. I was able to confirm that this actually works by running
tests with this diff, which simulates an incoming raft message
reflecting a high replicaID just before the split gets applied.

diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 6d5a044657..981f3e1714 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock(
 func (r *Replica) acquireSplitLock(
 	ctx context.Context, split *roachpb.SplitTrigger,
 ) (func(), error) {
-	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
+	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil)
 	if err != nil {
 		return nil, err
 	}

@danhhz is working on a real test for this, so we'll soon have
end-to-end coverage.

Fixes #39651.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from bdarnell August 14, 2019 12:15
See cockroachdb#39571.

The above PR was incomplete because we also need to wind back the
replicaID. I was able to confirm that this *actually* works by running
tests with this diff, which simulates an incoming raft message
reflecting a high replicaID just before the split gets applied.

```diff
diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 6d5a044657..981f3e1714 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock(
 func (r *Replica) acquireSplitLock(
 	ctx context.Context, split *roachpb.SplitTrigger,
 ) (func(), error) {
-	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
+	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil)
 	if err != nil {
 		return nil, err
 	}
```

@danhhz is working on a real test for this, so we'll soon have
end-to-end coverage.

Fixes cockroachdb#39651.

Release note: None
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM though definitely 🔬🐶

@tbg
Copy link
Member Author

tbg commented Aug 14, 2019

TFTR

bors r=danhhz

craig bot pushed a commit that referenced this pull request Aug 14, 2019
39658: storage: avoid crash in splitPostApply r=danhhz a=tbg

See #39571.

The above PR was incomplete because we also need to wind back the
replicaID. I was able to confirm that this *actually* works by running
tests with this diff, which simulates an incoming raft message
reflecting a high replicaID just before the split gets applied.

```diff
diff --git a/pkg/storage/replica_raft.go b/pkg/storage/replica_raft.go
index 6d5a044657..981f3e1714 100644
--- a/pkg/storage/replica_raft.go
+++ b/pkg/storage/replica_raft.go
@@ -1525,7 +1525,7 @@ func (r *Replica) maybeAcquireSplitMergeLock(
 func (r *Replica) acquireSplitLock(
 	ctx context.Context, split *roachpb.SplitTrigger,
 ) (func(), error) {
-	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 0, nil)
+	rightRng, _, err := r.store.getOrCreateReplica(ctx, split.RightDesc.RangeID, 100, nil)
 	if err != nil {
 		return nil, err
 	}
```

@danhhz is working on a real test for this, so we'll soon have
end-to-end coverage.

Fixes #39651.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
tbg added a commit that referenced this pull request Aug 14, 2019
When a heartbeat failed while a request was in-flight, two proto updates
could race and produce illegal behavior. For example, it was possible
for a client to successfully commit but receive an ABORTED proto (and no
error).

This commit removes the ad-hoc manipulation of the state during
heartbeat failures. Instead, before each new request sent through the
heartbeat interceptor we check on the heartbeat loop's status; if it
indicates an error, we refuse new requests.

Fixes #39658.

Release note: None
@craig
Copy link
Contributor

craig bot commented Aug 14, 2019

Build succeeded

@craig craig bot merged commit d822868 into cockroachdb:master Aug 14, 2019
tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2019
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 26, 2019
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg

This PR contains a series of commits that first pave for the way and ultimately
allow carrying out atomic replication changes via Raft joint consensus.

Atomic replication changes are required to avoid entering unsafe configurations
during lateral data movement. See #12768 for details; this is a problem we want
to address in 19.2.

Before merging this we'll need to sort out an upstream change in Raft which
has made a bug in our code related to learner snapshots much more likely; the
offending upstream commit is patched out of the vendored etcd bump in this PR
at the time of writing.

An antichronological listing of the individual commits follows. They should be
reviewed individually, though it may be helpful to look at the overall diff for
overall context. A modest amount of churn may exist between the commits, though
a good deal of effort went into avoiding this.

    storage: allow atomic replication changes in ChangeReplicas

    They default to OFF.

    This needs a lot more tests which will be added separately in the course of
    switching the default to ON and will focus on the interactions of joint
    states with everything else in the system.

    We'll also need another audit of consumers of the replica descriptors to
    make sure nothing was missed in the first pass.

    Release note: None

    storage: fix replicaGCQueue addition on removal trigger

    Once we enter joint changes, the replica to be removed will show up in
    `crt.Removed()` when the joint state is entered, but it only becomes
    eligible for actual removal when we leave the joint state later. The new
    code triggers at the right time, namely when the replica is no longer in
    the descriptor.

    Release note: None

    storage: let execChangeReplicasTxn construct the descriptor

    Prior to this commit, the method took both an old and a new desc *plus*
    slices of added and removed replicas. This had grown organically, wasn't an
    easily understood interface, led to repetitive and tricky code at the
    callers, and most importantly isn't adequate any more in a world with
    atomic replication changes, where execChangeReplicasTxn in constructing the
    ChangeReplicasTrigger is essentially deciding whether a joint configuration
    needs to be entered (which in turn determines what the descriptor needs to
    look like in the first place). To start solving this, let
    execChangeReplicasTxn create (and on success return) the new descriptor.
    Callers instead pass in what they want to be done, which is accomplished
    via an []internalReplicationChange slice.

    Release note: None

    roachpb: auto-assign ReplicaID during AddReplica

    This is a cleanup leading up to a larger refactor of the contract around
    `execChangeReplicasTxn`.

    Release note: None

    storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate

    This prepares the trigger -> raft translation code to properly handle
    atomic replication changes.

    This carries out a lot of validation to give us confidence that any unusual
    transitions would be caught quickly.

    This change also establishes more clearly which added and removed replicas
    are to be passed into the trigger when transitioning into a joint
    configuration. For example, when adding a voter, one technically replaces a
    Learner with a VoterIncoming and so the question is which type the replica
    in the `added` slice should have.  Picking the Learner would give the
    trigger the most power to validate the input, but it's annoying to have
    divergent descriptors floating around, so by convention we say that it is
    always the updated version of the descriptor (i.e. for fully removed
    replicas, just whatever it was before it disappeared). I spent more time on
    this than I'm willing to admit, in particular looking removing the
    redundancy here, but it made things more awkward than was worth it.

    Release note: None

    storage: push replication change unrolling into ChangeReplicas

    There are various callers to ChangeReplicas, so it makes more sense to
    unroll at that level. The code was updated to - in principle - do the right
    thing when atomic replication changes are requested, except that they are
    still unimplemented and a fatal error will serve as a reminder of that. Of
    course nothing issues them yet.

    Release note: None

    storage: skip ApplyConfChange on rejected entry

    When in a joint configuration, passing an empty conf change to
    ApplyConfChange doesn't do the right thing any more: it tells Raft that
    we're leaving the joint config. It's not a good idea to try to tell Raft
    anything about a ConfChange that got rejected. Raft internally knows that
    we handled it because it knows the applied index.

    This also adds a case match for ConfChangeV2 which is necessary to route
    atomic replication changes (ConfChangeV2).

    See etcd-io/etcd#11046

    Release note: None

    storage: un-embed decodedConfChange

    I ate a number of NPEs during development because nullable embedded fields
    are tricky; they hide the pointer derefs that often need a nil check. We'll
    embed the fields of decodedConfChange instead which works out better. This
    commit also adds the unmarshaling code necessary for ConfChangeV2 needed
    once we issue atomic replication changes.

    Release note: None

    storage: add learners one by one

    Doing more than one change at once is going to force us into an atomic
    replication change. This isn't crazy, but seems unnecessary at this point,
    so just add the learners one by one.

    Release note: None

    storage: add fatals where atomic conf changes are unsupported

    These will be upgraded with proper handling when atomic replication changes
    are actually introduced, but for now it's convenient to stub out some code
    that will need to handle them and to make sure we won't forget to do so
    later.

    Release note: None

    storage: add atomic replication changes cluster setting

    This defaults to false, and won't have an effect unless the newly
    introduced cluster version is also active.

    Release note: None

    roachpb: support zero-change ChangeReplicasTrigger

    We will use a ChangeReplicasTrigger without additions and removals when
    transitioning out of a joint configuration, so make sure it supports this
    properly.

    Release note: None

    roachpb: return "desired" voters from ReplicaDescriptors.Voters

    Previous commits introduced (yet unused) voter types to encode joint
    consensus configurations which occur during atomic replication changes.

    Access to the slice of replicas is unfortunately common, though at least
    it's compartmentalized via the getters Voters() and Learners().

    The main problem solved in this commit is figuring out what should be
    returned from Voters(): is it all VoterX types, or only voters in one of
    the two majority configs part of a joint quorum?

    The useful answer is returning the set of voters corresponding to what the
    config will be once the joint state is exited; this happens to be what most
    callers care about. Incoming and full voters are really the same thing in
    our code; we just need to distinguish them from outgoing voters to
    correctly maintain the quorum sizes.

    Of course there are some callers that do care about quorum sizes, and a
    number of cleanups were made for them.

    This commit also adds a ReplicaDescriptors.ConfState helper which is then
    used in all of the places that were previously cobbling together a
    ConfState manually.

    Release note: None

    roachpb: add ReplicaType_Voter{Incoming,Outgoing}

    These are required for atomic replication changes to describe joint
    configurations, i.e. configurations consisting of two sets of replica which
    both need to reach quorum to make replication decisions.

    An audit of existing consumers of this enum will follow.

    Release note: None

    roachpb: rename ReplicaType variants

    The current naming is idiomatic for proto enums, but atypical for its usage
    in Go code. There is no `(gogoproto.customname)` that can fix this, and
    we're about to add more replica types that would require awkward names such
    as `roachpb.ReplicaType_VOTER_OUTGOING`.

    Switch to a Go-friendly naming scheme instead.

    Release note: None

    batcheval: generalize checkNotLearnerReplica

    This now errors out whenever the replica is not a voter, which is more
    robust as new replica types are introduced (which generally should not
    automatically become eligible to receive leases).

    Release note: None

    roachpb: improve RangeDescriptor.Validate

    Make sure there isn't more than one replica per store.

    Release note: None

    roachpb: generalize ReplicaDescriptor.String()

    The new code will generalize to new replica types.

    Release note: None

    [dnm] vendor: bump raft

    This picks up upstream fixes related to atomic membership changes.

    I had to smuggle in a small hack because we're picking up
    etcd-io/etcd#11037 which makes a race between the
    snapshot queue and the proactive learner snapshot much more likely, and
    this in turn makes tests quite flaky because it turns out that if the
    learner snap loses, it can actually error out.

    Release note: None

    storage: avoid fatal error from splitPostApply

    This is the next band-aid on top of #39658 and #39571. The descriptor
    lookup I added sometimes fails because replicas can process a split trigger
    in which they're not a member of the range:

    > F190821 15:14:28.241623 312191 storage/store.go:2172
    > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
    > found in right hand side of split

    I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

    Release note: None

40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany

Addresses part of #39938.

Release note (cli change): Default cluster locality topologies for
multi-node cockroach demo clusters.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
tbg added a commit to tbg/cockroach that referenced this pull request Sep 6, 2019
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

There is one more bug here that is not addressed by this change,
namely the fact that there might be a tombstone for the right
hand side. This continues to be tracked in cockroachdb#40470.

See also
cockroachdb#40367 (comment)
for more work we need to do to establish sanity.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Sep 6, 2019
This is the next band-aid on top of cockroachdb#39658 and cockroachdb#39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

There is one more bug here that is not addressed by this change,
namely the fact that there might be a tombstone for the right
hand side. This continues to be tracked in cockroachdb#40470.

See also
cockroachdb#40367 (comment)
for more work we need to do to establish sanity.

Release note: None
craig bot pushed a commit that referenced this pull request Sep 6, 2019
39796: storage: avoid (one) fatal error from splitPostApply r=ajwerner a=tbg

This is the next band-aid on top of #39658 and #39571.
The descriptor lookup I added sometimes fails because replicas can
process a split trigger in which they're not a member of the range:

> F190821 15:14:28.241623 312191 storage/store.go:2172
> [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
> found in right hand side of split

I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

@danhhz could you take a stab at replicating this in the test you added
recently? It's probably too ambitious but it'd be nice to verify this is
actually fixed (I didn't). I haven't seen this fail on master (though I also
didn't look very hard) so maybe this is rare (i.e. we have time), but then I
just got it doing nothing related in particular.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ccl/partitionccl: TestInitialPartitioning failed under stress
3 participants