Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: fix auto-transitioning out of joint config #11046

Merged
merged 3 commits into from
Feb 25, 2020

Conversation

tbg
Copy link
Contributor

@tbg tbg commented Aug 16, 2019

The code doing so was undertested and buggy: it would launch multiple
attempts to transition out when the conf change was not the last element in
the log.

This commit fixes the problem and adds a regression test. It also reworks
the code to handle a former untested edge case, in which the
auto-transition append is refused. This can't happen any more with the
current version of the code because this proposal has size zero and is
special cased in increaseUncommittedSize. Last but not least, the
auto-leave proposal now also bumps pendingConfIndex, which was not done
previously due to an oversight.

tbg added a commit to tbg/cockroach that referenced this pull request Aug 23, 2019
When in a joint configuration, passing an empty conf change to
ApplyConfChange doesn't do the right thing any more: it tells
Raft that we're leaving the joint config. It's not a good idea
to try to tell Raft anything about a ConfChange that got rejected.
Raft internally knows that we handled it because it knows the applied
index.

This also adds a case match for ConfChangeV2 which is necessary to
route atomic replication changes (ConfChangeV2).

See etcd-io/etcd#11046

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Aug 26, 2019
When in a joint configuration, passing an empty conf change to
ApplyConfChange doesn't do the right thing any more: it tells
Raft that we're leaving the joint config. It's not a good idea
to try to tell Raft anything about a ConfChange that got rejected.
Raft internally knows that we handled it because it knows the applied
index.

This also adds a case match for ConfChangeV2 which is necessary to
route atomic replication changes (ConfChangeV2).

See etcd-io/etcd#11046

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Aug 26, 2019
When in a joint configuration, passing an empty conf change to
ApplyConfChange doesn't do the right thing any more: it tells
Raft that we're leaving the joint config. It's not a good idea
to try to tell Raft anything about a ConfChange that got rejected.
Raft internally knows that we handled it because it knows the applied
index.

This also adds a case match for ConfChangeV2 which is necessary to
route atomic replication changes (ConfChangeV2).

See etcd-io/etcd#11046

Release note: None
craig bot pushed a commit to cockroachdb/cockroach that referenced this pull request Aug 26, 2019
39936: storage: add (default-off) atomic replication changes r=nvanbenschoten a=tbg

This PR contains a series of commits that first pave for the way and ultimately
allow carrying out atomic replication changes via Raft joint consensus.

Atomic replication changes are required to avoid entering unsafe configurations
during lateral data movement. See #12768 for details; this is a problem we want
to address in 19.2.

Before merging this we'll need to sort out an upstream change in Raft which
has made a bug in our code related to learner snapshots much more likely; the
offending upstream commit is patched out of the vendored etcd bump in this PR
at the time of writing.

An antichronological listing of the individual commits follows. They should be
reviewed individually, though it may be helpful to look at the overall diff for
overall context. A modest amount of churn may exist between the commits, though
a good deal of effort went into avoiding this.

    storage: allow atomic replication changes in ChangeReplicas

    They default to OFF.

    This needs a lot more tests which will be added separately in the course of
    switching the default to ON and will focus on the interactions of joint
    states with everything else in the system.

    We'll also need another audit of consumers of the replica descriptors to
    make sure nothing was missed in the first pass.

    Release note: None

    storage: fix replicaGCQueue addition on removal trigger

    Once we enter joint changes, the replica to be removed will show up in
    `crt.Removed()` when the joint state is entered, but it only becomes
    eligible for actual removal when we leave the joint state later. The new
    code triggers at the right time, namely when the replica is no longer in
    the descriptor.

    Release note: None

    storage: let execChangeReplicasTxn construct the descriptor

    Prior to this commit, the method took both an old and a new desc *plus*
    slices of added and removed replicas. This had grown organically, wasn't an
    easily understood interface, led to repetitive and tricky code at the
    callers, and most importantly isn't adequate any more in a world with
    atomic replication changes, where execChangeReplicasTxn in constructing the
    ChangeReplicasTrigger is essentially deciding whether a joint configuration
    needs to be entered (which in turn determines what the descriptor needs to
    look like in the first place). To start solving this, let
    execChangeReplicasTxn create (and on success return) the new descriptor.
    Callers instead pass in what they want to be done, which is accomplished
    via an []internalReplicationChange slice.

    Release note: None

    roachpb: auto-assign ReplicaID during AddReplica

    This is a cleanup leading up to a larger refactor of the contract around
    `execChangeReplicasTxn`.

    Release note: None

    storage: emit ConfChangeV2 from ChangeReplicasTrigger where appropriate

    This prepares the trigger -> raft translation code to properly handle
    atomic replication changes.

    This carries out a lot of validation to give us confidence that any unusual
    transitions would be caught quickly.

    This change also establishes more clearly which added and removed replicas
    are to be passed into the trigger when transitioning into a joint
    configuration. For example, when adding a voter, one technically replaces a
    Learner with a VoterIncoming and so the question is which type the replica
    in the `added` slice should have.  Picking the Learner would give the
    trigger the most power to validate the input, but it's annoying to have
    divergent descriptors floating around, so by convention we say that it is
    always the updated version of the descriptor (i.e. for fully removed
    replicas, just whatever it was before it disappeared). I spent more time on
    this than I'm willing to admit, in particular looking removing the
    redundancy here, but it made things more awkward than was worth it.

    Release note: None

    storage: push replication change unrolling into ChangeReplicas

    There are various callers to ChangeReplicas, so it makes more sense to
    unroll at that level. The code was updated to - in principle - do the right
    thing when atomic replication changes are requested, except that they are
    still unimplemented and a fatal error will serve as a reminder of that. Of
    course nothing issues them yet.

    Release note: None

    storage: skip ApplyConfChange on rejected entry

    When in a joint configuration, passing an empty conf change to
    ApplyConfChange doesn't do the right thing any more: it tells Raft that
    we're leaving the joint config. It's not a good idea to try to tell Raft
    anything about a ConfChange that got rejected. Raft internally knows that
    we handled it because it knows the applied index.

    This also adds a case match for ConfChangeV2 which is necessary to route
    atomic replication changes (ConfChangeV2).

    See etcd-io/etcd#11046

    Release note: None

    storage: un-embed decodedConfChange

    I ate a number of NPEs during development because nullable embedded fields
    are tricky; they hide the pointer derefs that often need a nil check. We'll
    embed the fields of decodedConfChange instead which works out better. This
    commit also adds the unmarshaling code necessary for ConfChangeV2 needed
    once we issue atomic replication changes.

    Release note: None

    storage: add learners one by one

    Doing more than one change at once is going to force us into an atomic
    replication change. This isn't crazy, but seems unnecessary at this point,
    so just add the learners one by one.

    Release note: None

    storage: add fatals where atomic conf changes are unsupported

    These will be upgraded with proper handling when atomic replication changes
    are actually introduced, but for now it's convenient to stub out some code
    that will need to handle them and to make sure we won't forget to do so
    later.

    Release note: None

    storage: add atomic replication changes cluster setting

    This defaults to false, and won't have an effect unless the newly
    introduced cluster version is also active.

    Release note: None

    roachpb: support zero-change ChangeReplicasTrigger

    We will use a ChangeReplicasTrigger without additions and removals when
    transitioning out of a joint configuration, so make sure it supports this
    properly.

    Release note: None

    roachpb: return "desired" voters from ReplicaDescriptors.Voters

    Previous commits introduced (yet unused) voter types to encode joint
    consensus configurations which occur during atomic replication changes.

    Access to the slice of replicas is unfortunately common, though at least
    it's compartmentalized via the getters Voters() and Learners().

    The main problem solved in this commit is figuring out what should be
    returned from Voters(): is it all VoterX types, or only voters in one of
    the two majority configs part of a joint quorum?

    The useful answer is returning the set of voters corresponding to what the
    config will be once the joint state is exited; this happens to be what most
    callers care about. Incoming and full voters are really the same thing in
    our code; we just need to distinguish them from outgoing voters to
    correctly maintain the quorum sizes.

    Of course there are some callers that do care about quorum sizes, and a
    number of cleanups were made for them.

    This commit also adds a ReplicaDescriptors.ConfState helper which is then
    used in all of the places that were previously cobbling together a
    ConfState manually.

    Release note: None

    roachpb: add ReplicaType_Voter{Incoming,Outgoing}

    These are required for atomic replication changes to describe joint
    configurations, i.e. configurations consisting of two sets of replica which
    both need to reach quorum to make replication decisions.

    An audit of existing consumers of this enum will follow.

    Release note: None

    roachpb: rename ReplicaType variants

    The current naming is idiomatic for proto enums, but atypical for its usage
    in Go code. There is no `(gogoproto.customname)` that can fix this, and
    we're about to add more replica types that would require awkward names such
    as `roachpb.ReplicaType_VOTER_OUTGOING`.

    Switch to a Go-friendly naming scheme instead.

    Release note: None

    batcheval: generalize checkNotLearnerReplica

    This now errors out whenever the replica is not a voter, which is more
    robust as new replica types are introduced (which generally should not
    automatically become eligible to receive leases).

    Release note: None

    roachpb: improve RangeDescriptor.Validate

    Make sure there isn't more than one replica per store.

    Release note: None

    roachpb: generalize ReplicaDescriptor.String()

    The new code will generalize to new replica types.

    Release note: None

    [dnm] vendor: bump raft

    This picks up upstream fixes related to atomic membership changes.

    I had to smuggle in a small hack because we're picking up
    etcd-io/etcd#11037 which makes a race between the
    snapshot queue and the proactive learner snapshot much more likely, and
    this in turn makes tests quite flaky because it turns out that if the
    learner snap loses, it can actually error out.

    Release note: None

    storage: avoid fatal error from splitPostApply

    This is the next band-aid on top of #39658 and #39571. The descriptor
    lookup I added sometimes fails because replicas can process a split trigger
    in which they're not a member of the range:

    > F190821 15:14:28.241623 312191 storage/store.go:2172
    > [n2,s2,r21/3:/{Table/54-Max}] replica descriptor of local store not
    > found in right hand side of split

    I saw this randomly in `make test PKG=./pkg/ccl/partitionccl`.

    Release note: None

40221: cli: Add default locality settings for multi node demo clusters r=jordanlewis a=rohany

Addresses part of #39938.

Release note (cli change): Default cluster locality topologies for
multi-node cockroach demo clusters.

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Rohan Yadav <rohany@alumni.cmu.edu>
raft/raft.go Outdated
@@ -1607,16 +1606,21 @@ func (r *raft) abortLeaderTransfer() {
// If the new entries would exceed the limit, the method returns false. If not,
// the increase in uncommitted entry size is recorded and the method returns
// true.
// Configuration changes are never refused.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concerning ourselves with the size of the payload and trying to enforce this by allowing zero-byte payloads, could we just check the entry's type directly? We may need to change this condition to "Configuration changes that are alone are never refused".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it the case that we also want to allow the empty entry that new leaders propose and this seems like the cleanest way to do it? If that's the case then "Configuration changes are never refused" isn't accurate, right? What we actually want to say is that "entries without a payload are never refused".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in both cases it's the empty payload. I clarified the comment.

raft/testdata/confchange_v2_add_double_auto.txt Outdated Show resolved Hide resolved
@xiang90
Copy link
Contributor

xiang90 commented Jan 15, 2020

@tbg @nvanbenschoten

shall we push forward this PR? the issue mentioned in the PR needs to be fixed.

@tbg
Copy link
Contributor Author

tbg commented Feb 6, 2020

@xiang90 yep and sorry for the radio silence. It will happen.

The test was supposed to end earlier, but some old copy pasta
survived.
The code doing so was undertested and buggy: it would launch multiple
attempts to transition out when the conf change was not the last element
in the log.

This commit fixes the problem and adds a regression test. It also
reworks the code to handle a former untested edge case, in which the
auto-transition append is refused. This can't happen any more with the
current version of the code because this proposal has size zero and is
special cased in increaseUncommittedSize. Last but not least, the
auto-leave proposal now also bumps pendingConfIndex, which was not done
previously due to an oversight.
Apps typically maintain the raft configuration as part of the state
machine. As a result, they want to be able to reject configuration change
entries at apply time based on the state on which the entry is supposed
to be applied. When this happens, the app should not call
ApplyConfChange, but the comments did not make this clear.

As a result, it was tempting to pass an empty pb.ConfChange or it's V2
version instead of not calling ApplyConfChange.

However, an empty V1 or V2 proto aren't noops when the configuration is
joint: an empty V1 change is treated internally as a single
configuration change for NodeID zero and will cause a panic when applied
in a joint state. An empty V2 proto is treated as a signal to leave a
joint state, which means that the app's config and raft's would diverge.

The comments updated in this commit now ask users to not call
ApplyConfState when they reject a conf change. Apps that never use joint
consensus can keep their old behavior since the distinction only matters
when in a joint state, but we don't want to encourage that.
@codecov-io
Copy link

Codecov Report

Merging #11046 into master will decrease coverage by 0.86%.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11046      +/-   ##
==========================================
- Coverage   66.65%   65.78%   -0.87%     
==========================================
  Files         401      401              
  Lines       36629    36627       -2     
==========================================
- Hits        24414    24096     -318     
- Misses      10715    11030     +315     
- Partials     1500     1501       +1
Impacted Files Coverage Δ
raft/rawnode.go 73.56% <ø> (ø) ⬆️
raft/node.go 89.47% <ø> (ø) ⬆️
raft/raft.go 90.9% <90%> (+0.99%) ⬆️
clientv3/balancer/utils.go 53.84% <0%> (-46.16%) ⬇️
client/client.go 53.26% <0%> (-30.72%) ⬇️
auth/store.go 51.84% <0%> (-23.57%) ⬇️
pkg/transport/timeout_conn.go 60% <0%> (-20%) ⬇️
proxy/grpcproxy/register.go 72.5% <0%> (-10%) ⬇️
auth/simple_token.go 79.83% <0%> (-9.25%) ⬇️
client/keys.go 83.41% <0%> (-8.05%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d7d8c0...0544f33. Read the comment docs.

@tbg tbg merged commit 52fba43 into etcd-io:master Feb 25, 2020
@tbg tbg deleted the joint-auto-leave branch February 25, 2020 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants