Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: avoid hanging proposal after leader goes down #46045

Merged
merged 4 commits into from
Mar 17, 2020

Commits on Mar 12, 2020

  1. cmd/roachtest: deflake gossip/chaos roachtest

    Deflake `gossip/chaos` by adding a missing
    `waitForFullReplication`. This test loops, killing a node and then
    verifying that the remaining nodes in the cluster stabilize on the same
    view of gossip connectivity. Periodically the test was failing because
    gossip wasn't stabilizing. The root issue was that the SQL query to
    retrieve the gossip connectivity from one node was hanging. And that
    query was hanging due to unavailability of a range. Logs show that the
    leaseholder for that range was on a down node and that the range only
    seemed to contain a single replica. This could happen near the start of
    the test if we started killing nodes before full replication was
    achieved.
    
    Fixes cockroachdb#38829
    
    Release note: None
    petermattis authored and tbg committed Mar 12, 2020
    Configuration menu
    Copy the full SHA
    2783f1a View commit details
    Browse the repository at this point in the history
  2. roachtest: improve status duration display

    Release justification: testing change
    Release note: None
    tbg committed Mar 12, 2020
    Configuration menu
    Copy the full SHA
    499cbb4 View commit details
    Browse the repository at this point in the history
  3. kvserver: comment on propBuf locking

    Release justification: comment-only change
    Release note: None
    tbg committed Mar 12, 2020
    Configuration menu
    Copy the full SHA
    b539a0e View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2020

  1. kvserver: avoid hanging proposal after leader goes down

    There was a bug in range quiescence due to which commands would hang in
    raft for minutes before actually getting replicated. This would occur
    whenever a range was quiesced but a follower replica which didn't know
    the (Raft) leader would receive a request.  This request would be
    evaluated and put into the Raft proposal buffer, and a ready check would
    be enqueued. However, no ready would be produced (since the proposal got
    dropped by raft; leader unknown) and so the replica would not unquiesce.
    
    This commit prevents this by always waking up the group if the proposal
    buffer was initially nonempty, even if an empty Ready is produced.
    
    It goes further than that by trying to ensure that a leader is always
    known while quiesced. Previously, on an incoming request to quiesce, we
    did not verify that the raft group had learned the leader's identity.
    
    One shortcoming here is that in the situation in which the proposal
    would originally hang "forever", it will now hang for one heartbeat
    timeout where ideally it would be proposed more reactively. Since
    this is so rare I didn't try to address this. Instead, refer to
    the ideas in
    
    cockroachdb#37906 (comment)
    
    and
    
    cockroachdb#21849
    
    for future changes that could mitigate this.
    
    Without this PR, the test would fail around 10% of the time. With this
    change, it passed 40 iterations in a row without a hitch, via:
    
        ./bin/roachtest run -u tobias --count 40 --parallelism 10 --cpu-quota 1280 gossip/chaos/nodes=9
    
    Release justification: bug fix
    Release note (bug fix): a rare case in which requests to a quiesced
    range could hang in the KV replication layer was fixed. This would
    manifest as a message saying "have been waiting ... for proposing" even
    though no loss of quorum occurred.
    tbg committed Mar 16, 2020
    Configuration menu
    Copy the full SHA
    1f95860 View commit details
    Browse the repository at this point in the history