Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Handle raft.ErrProposalDropped #21849

Open
bdarnell opened this issue Jan 28, 2018 · 4 comments
Open

storage: Handle raft.ErrProposalDropped #21849

bdarnell opened this issue Jan 28, 2018 · 4 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Jan 28, 2018

etcd-io/etcd#9067 introduced a new error ErrProposalDropped. Properly handling this error will allow us to reduce the occurrence of ambiguous failures (Replica.executeWriteBatch doesn't need to return an AmbiguousResultError if it has not successfully proposed), and may allow us to be more intelligent in our raft-level retries.

This error alone does not allow us to eliminate time-based reproposals because raft's MsgProp forwarding is fire-and-forget. However, if we disabled raft-level forwarding and did our own forwarding, I think we could make the retry logic more deterministic (or at least hide the timing elements in the RPC layer).

Note that in typical usage of raft, you'd respond to this error by passing it up the stack to a layer that can try again on a different replica. We can't do that because of our leases - until the lease expires, no other node could successfully handle the request, so we have to just wait and retry on the lease holder. (we might be able to use this to make lease requests themselves fail-fast, though).

Jira issue: CRDB-5872

Epic CRDB-39898

@bdarnell bdarnell added this to the 2.1 milestone Jan 28, 2018
@bdarnell bdarnell self-assigned this Jan 28, 2018
@bdarnell bdarnell added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 26, 2018
@tbg tbg modified the milestones: 2.1, 2.2 Jul 19, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@bdarnell bdarnell removed their assignment Jul 31, 2019
@tbg
Copy link
Member

tbg commented Dec 4, 2019

Proposal forwarding has become a general problem recently. We see this come up both in #37906 and #42821.

In the former, we want to make sure that a follower that's behind can't propose (and get) a lease. Allowing only the raft leader to add proposals to its log would be a way to get that (mod "command runs on old raft leader while there's already another one").

In the latter, we had to add tricky code below raft to re-add commands under a new lease proposed index if we could detect that they had not applied and were no longer applicable. This solution is basically technical debt and it has cost us numerous subtle bugs over time. We want to get rid of it, which means making sure that lease applied indexes don't generally reorder. Again, this can be achieved with good enough accuracy by only ever adding to the local log, i.e. never forwarding proposals.

It's not trivial to set this up, but ultimately I think we'll be better off.

The rules would be something like this:

  1. don't propose to raft unless local raft instance believes it is leader (i.e. block on a state change, perhaps wake up/campaign eagerly when appropriate). Due to our existing heuristics for campaigning actively when it's appropriate the situation in which there's no leader should be transient (and rare). Note that we want to only ever propose with a reasonable max lease index, but this will be given because we wouldn't be proposing before seeing our own lease, at which point we're the only one proposing and know the current LAI with perfect accuracy.
  2. disable proposal forwarding. On ErrProposalDropped, go back up the stack into 1) because the leader has likely changed. If there's a raft leader and no lease is held, direct the command to that replica instead. (This last item also makes sure that when there's no lease, the raft leader will be the node that gets to request a lease).
  3. lease transfers might need some smoothing out. The transferer initially holds both lease and raft leadership. Once it has proposed the lease transfer, it immediately transfers Raft leadership over. (I don't know if the process is reactive enough today; there's chance it leaves the raft leadership transfer to the tick-based mechanism today).

Since only the leaseholder proposes to Raft, the leaseholder should be able to hold or regain membership just fine. We just have to make sure that a follower won't ever campaign against an active leaseholder.

@tbg
Copy link
Member

tbg commented Dec 4, 2019

(cc @nvanbenschoten and @ajwerner since we both talked about this)

tbg added a commit to tbg/cockroach that referenced this issue Mar 12, 2020
There was a bug in range quiescence due to which commands would hang in
raft for minutes before actually getting replicated. This would occur
whenever a range was quiesced but a follower replica which didn't know
the (Raft) leader would receive a request.  This request would be
evaluated and put into the Raft proposal buffer, and a ready check would
be enqueued. However, no ready would be produced (since the proposal got
dropped by raft; leader unknown) and so the replica would not unquiesce.

This commit prevents this by always waking up the group if the proposal
buffer was initially nonempty, even if an empty Ready is produced.

It goes further than that by trying to ensure that a leader is always
known while quiesced. Previously, on an incoming request to quiesce, we
did not verify that the raft group had learned the leader's identity.

One shortcoming here is that in the situation in which the proposal
would originally hang "forever", it will now hang for one heartbeat
timeout where ideally it would be proposed more reactively. Since
this is so rare I didn't try to address this. Instead, refer to
the ideas in

cockroachdb#37906 (comment)

and

cockroachdb#21849

for future changes that could mitigate this.

Without this PR, the test would fail around 10% of the time. With this
change, it passed 40 iterations in a row without a hitch, via:

    ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9

Release justification: bug fix
Release note (bug fix): a rare case in which requests to a quiesced
range could hang in the KV replication layer was fixed. This would
manifest as a message saying "have been waiting ... for proposing" even
though no loss of quorum occurred.
tbg added a commit to tbg/cockroach that referenced this issue Mar 16, 2020
There was a bug in range quiescence due to which commands would hang in
raft for minutes before actually getting replicated. This would occur
whenever a range was quiesced but a follower replica which didn't know
the (Raft) leader would receive a request.  This request would be
evaluated and put into the Raft proposal buffer, and a ready check would
be enqueued. However, no ready would be produced (since the proposal got
dropped by raft; leader unknown) and so the replica would not unquiesce.

This commit prevents this by always waking up the group if the proposal
buffer was initially nonempty, even if an empty Ready is produced.

It goes further than that by trying to ensure that a leader is always
known while quiesced. Previously, on an incoming request to quiesce, we
did not verify that the raft group had learned the leader's identity.

One shortcoming here is that in the situation in which the proposal
would originally hang "forever", it will now hang for one heartbeat
timeout where ideally it would be proposed more reactively. Since
this is so rare I didn't try to address this. Instead, refer to
the ideas in

cockroachdb#37906 (comment)

and

cockroachdb#21849

for future changes that could mitigate this.

Without this PR, the test would fail around 10% of the time. With this
change, it passed 40 iterations in a row without a hitch, via:

    ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9

Release justification: bug fix
Release note (bug fix): a rare case in which requests to a quiesced
range could hang in the KV replication layer was fixed. This would
manifest as a message saying "have been waiting ... for proposing" even
though no loss of quorum occurred.
craig bot pushed a commit that referenced this issue Mar 17, 2020
46045: kvserver: avoid hanging proposal after leader goes down  r=nvanbenschoten,petermattis a=tbg

Deflakes gossip/chaos/nodes=9, i.e.

Fixes #38829.

There was a bug in range quiescence due to which commands would hang in
raft for minutes before actually getting replicated. This would occur
whenever a range was quiesced but a follower replica which didn't know
the (Raft) leader would receive a request.  This request would be
evaluated and put into the Raft proposal buffer, and a ready check would
be enqueued. However, no ready would be produced (since the proposal got
dropped by raft; leader unknown) and so the replica would not unquiesce.

This commit prevents this by always waking up the group if the proposal
buffer was initially nonempty, even if an empty Ready is produced.

It goes further than that by trying to ensure that a leader is always
known while quiesced. Previously, on an incoming request to quiesce, we
did not verify that the raft group had learned the leader's identity.

One shortcoming here is that in the situation in which the proposal
would originally hang "forever", it will now hang for one heartbeat
or election timeout where ideally it would be proposed more reactively. Since
this is so rare I didn't try to address this. Instead, refer to
the ideas in

#37906 (comment)

and

#21849

for future changes that could mitigate this.

Without this PR, the test would fail around 10% of the time. With this
change, it passed 40 iterations in a row without a hitch, via:

    ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9

Release justification: bug fix
Release note (bug fix): a rare case in which requests to a quiesced
range could hang in the KV replication layer was fixed. This would
manifest as a message saying "have been waiting ... for proposing" even
though no loss of quorum occurred.


Co-authored-by: Peter Mattis <petermattis@gmail.com>
Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
@github-actions
Copy link

github-actions bot commented Jun 8, 2021

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels May 31, 2022
@tbg tbg removed their assignment May 31, 2022
@tbg
Copy link
Member

tbg commented May 10, 2023

Some related musings #102956 (comment)

@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

6 participants