Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Use TransferLeadership to make leader demotion safer #1939

Merged
merged 1 commit into from
Feb 17, 2017

Conversation

aaronlehmann
Copy link
Collaborator

@aaronlehmann aaronlehmann commented Feb 8, 2017

When we demote the leader, we currently wait for all queued messages to be sent, as a best-effort approach to making sure the other nodes find out that the node removal has been committed, and stop treating the current leader as a cluster member. This doesn't work perfectly.

To make this more robust, use TransferLeadership when the leader is trying to remove itself. The new leader's reconcilation loop will kick in and remove the old leader.

cc @LK4D4 @cyli

@codecov-io
Copy link

codecov-io commented Feb 8, 2017

Codecov Report

Merging #1939 into master will decrease coverage by -0.31%.
The diff coverage is 20.37%.

@@            Coverage Diff             @@
##           master    #1939      +/-   ##
==========================================
- Coverage   54.29%   53.99%   -0.31%     
==========================================
  Files         108      108              
  Lines       18586    18588       +2     
==========================================
- Hits        10092    10036      -56     
- Misses       7257     7324      +67     
+ Partials     1237     1228       -9

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 569defc...5470e07. Read the comment docs.

@aaronlehmann
Copy link
Collaborator Author

Note to self: need to confirm this won't cause any problems in mixed version swarms. I don't think it will but it's important to be sure.

defer ticker.Stop()
for {
leader := n.leader()
if leader != 0 && leader != n.Config.ID {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace 0 with raft.None.
Also shouldn't we wait for transferee to became leader? Or it's possible that other node became leader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that if a different node starts an election at the same time, that one might win.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

}
select {
case <-ctx.Done():
return errors.Wrap(err, "timed out waiting for leadership change")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be not timeout I think, maybe better just errors.Wrap(err, "failed to transfer leadership")

start := time.Now()
log.G(ctx).Infof("raft: transfer leadership %x -> %x", n.Config.ID, transferee)
n.raftNode.TransferLeadership(ctx, n.Config.ID, transferee)
ticker := time.NewTicker(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that etcd uses TickInterval here. Do you think it make sense in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to TickInterval / 10 so we don't have to wait a full tick to recognize the leader has changed.

}

func (n *Node) removeSelfGracefully(ctx context.Context) error {
transferCtx, cancelTransfer := context.WithTimeout(ctx, 10*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks little too hardcody. Maybe we should inherit it from ElectionTick as etcd does?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd seems to use this:

// ReqTimeout returns timeout for request to finish.
func (c *ServerConfig) ReqTimeout() time.Duration {
        // 5s for queue waiting, computation and disk IO delay
        // + 2 * election timeout for possible leader election
        return 5*time.Second + 2*time.Duration(c.ElectionTicks)*time.Duration(c.TickMs)*time.Millisecond
}

That seems reasonable, but I'm not sure it's that much better than a simple fixed timeout (it hardcodes 5 seconds, for example). I guess I'll switch to it anyway.


return ErrCannotRemoveMember
// transferLeadership attempts to transfer leadership to a different node,
Copy link
Contributor

@cyli cyli Feb 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: would it make sense to duplicate the stopMu part of this comment up in removeSelfGracefully as well, since removeSelfGracefully doesn't hold stopMu either?

case <-ticker.C:
}
}
log.G(ctx).Infof("raft: transfer leadership %x -> %x finished in %v", n.Config.ID, transferee, time.Since(start))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be leader instead of transferee, in case some other node started leader election and then won?

@aaronlehmann
Copy link
Collaborator Author

Addressed comments, PTAL

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 9, 2017

@aaronlehmann I've tried TestDemoteToSingleManager and not a single transfer finished with success:

ERRO[0001] failed to leave raft cluster gracefully       error="failed to transfer leadership: context canceled"

return errors.Wrap(err, "failed to get longest-active member")
}
start := time.Now()
log.G(ctx).Infof("raft: transfer leadership %x -> %x", n.Config.ID, transferee)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message duplicates in etcd/raft, not sure if we care.

@cyli
Copy link
Contributor

cyli commented Feb 9, 2017

TestDemoteToSingleManager passed 10 times in a row for me when running that test alone. I am often seeing this though when running the integration test suite with -race, though (although my machine may be haunted again and it may be time to restart again):

--- FAIL: TestDemotePromote (75.30s)
        Error Trace:    integration_test.go:254
			integration_test.go:278
	Error:  	Received unexpected error error stop worker z2dj61jsjefe8dbunf0twdlo1: context deadline exceeded

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 9, 2017

Tests is passing okay, but transfer leadership doesn't work.

@cyli
Copy link
Contributor

cyli commented Feb 10, 2017

@LK4D4 Ah sorry, yes, thanks for clarifying. I'm seeing the same.

@aaronlehmann
Copy link
Collaborator Author

Thanks for pointing out that TransferLeadership was not actually succeeding in the tests.

The first problem here was that the context would be cancelled when the old leader lost the leadership. This caused the function to return an error.

I fixed this, but afterwards it turned out that the node would get removed twice: once by the call to Leave, and another by the new leader's reconciliation loop kicking in and trying to remove the same node.

So I'm trying a new, simpler, approach. RemoveNode no longer supports removing the local node. If this is attempted, an error gets returned, and the node role reconciliation loop can invoke TransferLeadership. Once there is a new leader, that leader can remove the old leader without incident.

PTAL

@aaronlehmann
Copy link
Collaborator Author

...of course the tests consistently pass on my machine :/

case <-ticker.C:
}
}
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transfer finish is not logged by raft library :(

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not, but raft does log when a new leader is elected, so I thought that would be enough.

@@ -57,6 +57,9 @@ var (
// ErrMemberUnknown is sent in response to a message from an
// unrecognized peer.
ErrMemberUnknown = errors.New("raft: member unknown")
// ErrCantRemoveSelf is returned if RemoveMember is called with the
// local node as the argument.
ErrCantRemoveSelf = errors.New("raft: can't remove self")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error above is using word Cannot, maybe better to keep it consistent.

if err == raft.ErrCantRemoveSelf {
// Don't use rmCtx, because we expect to lose
// leadership, which will cancel this context.
log.L.Info("demoted; ceding leadership")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I live in the USA for almost three years, and only vague know the meaning of the word ceding :) Let's use transferring or something more common.

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 10, 2017

have minor comments.
Looks good overall

@aaronlehmann
Copy link
Collaborator Author

I've made the cosmetic changes. I'm happy to add back the "transfered leadership" log message if you think it makes sense.

I had some test failures in CI with an earlier version of this, but I can't reproduce them locally. Can you try running the tests a few times to see if they are stable?

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 10, 2017

@aaronlehmann sure will run tests now.
Yeah, I liked transfer finished with times, it's kinda cool to see how fast raft is :)

@aaronlehmann
Copy link
Collaborator Author

Added back the transfer timing.

@aaronlehmann
Copy link
Collaborator Author

Adding back the n.asyncTasks.Wait() in applyRemoveNode seems to have stablized CI. I'm not sure why, because I expected this would only be necessary when the leader removes itself. Otherwise, this case should only be reached when a node finds out its removal has been committed, and then it shouldn't be necessary to communicate anymore.

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 10, 2017

Tests are quite stable for me, no leaks, no races.

Copy link
Contributor

@LK4D4 LK4D4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@aaronlehmann
Copy link
Collaborator Author

I'm rebuilding it a few more times in CI.

@aaronlehmann
Copy link
Collaborator Author

I can't get this to fail in CI anymore. Either the initial failures were a fluke, or adding back n.asyncTasks.Wait() made a difference for reasons I don't understand (maybe a timing issue?).

@@ -133,12 +133,19 @@ func (rm *roleManager) reconcileRole(node *api.Node) {
return
}

rmCtx, rmCancel := context.WithTimeout(rm.ctx, 5*time.Second)
rmCtx, rmCancel := context.WithTimeout(context.Background(), 5*time.Second)
Copy link
Contributor

@cyli cyli Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking question: Does this need to be context.Background() if rm.raft.TransferLeadership uses a new context?

@@ -327,74 +327,6 @@ func TestRaftFollowerLeave(t *testing.T) {
assert.Len(t, nodes[4].GetMemberlist(), 4)
}

func TestRaftLeaderLeave(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test be modified to just fail with the expected error rather than removed entirely?

Copy link
Contributor

@cyli cyli Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even dumber question: does anything besides the tests call Leave on a raft member anymore? (I assume even if not, we'd have to leave it in for backwards compatibility, but it might be nice to comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually even dumber question: does anything besides the tests call Leave on a raft member anymore?

Nope. Maybe we should remove it.

@cyli
Copy link
Contributor

cyli commented Feb 10, 2017

Nice, I'm not getting any intermittent context exceeded errors in other integration tests anymore. Just have a few questions above, but other than that LGTM.

When we demote the leader, we currently wait for all queued messages to
be sent, as a best-effort approach to making sure the other nodes find
out that the node removal has been committed, and stop treating the
current leader as a cluster member. This doesn't work perfectly.

To make this more robust, use TransferLeadership when the leader is
trying to remove itself. The new leader's reconcilation loop will kick
in and remove the old leader.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann
Copy link
Collaborator Author

I've updated this to work slightly differently. If the leadership transfer times out, we fall back to self-demoting the old way. I've tested this with a 1.14-dev leader in a cluster with a 1.12.6 follower, and demoting the leader works properly (after the timeout elapses).

PTAL. Hoping to merge this soon if it looks good.

@cyli
Copy link
Contributor

cyli commented Feb 17, 2017

LGTM! Thank you for tracking down a different unrelated error!

@LK4D4 LK4D4 merged commit a82deb6 into moby:master Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants