Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Member During Network Partition #63

Merged
merged 7 commits into from
Mar 25, 2020
Merged

Add Member During Network Partition #63

merged 7 commits into from
Mar 25, 2020

Conversation

JackyChiu
Copy link
Contributor

@JackyChiu JackyChiu commented Mar 23, 2020

closes #54

What

This PR covers the research and features needed to add members during network partitions and how to do it safely. This is further proved in the tests.

How

The root problem comes from the fact that when strict-reconfig-check is enabled, all etcd nodes must be connected even though the cluster should only require quorum to do so. The reason is to avoid the edge case where a newly introduce node is misconfigured, where it can result in a loss of quorum since the new node counts as a voting node right away.

This PR leaves the strict-reconfig-check up to the developers to disable themselves. However I did introduce a feature to reduce the risk when adding a new node.

In v3.4 etcd introduced a learner feature. This is an intermediate state used where a node is added as a member but doesn't count as a voter. Once it caught up in the leader's logs, it can be promoted to a voter node. This is where the cluster checks if the node is healthy and configured to join, something that isn't done currently when added a new node.

In this PR we add new members as learners and auto-promote them when they are caught up. The auto promote had to be built by us since it's not available yet.

Review

Review by commit for an easier time.

tldr

  • disable strict-reconfig-check to add node during network partition
  • adding members as learners reduces the risk of losing quorum

@JackyChiu JackyChiu changed the title Add Member During Network Parition Add Member During Network Partition Mar 23, 2020
cluster/cluster.go Show resolved Hide resolved
cluster/cluster_test.go Outdated Show resolved Hide resolved
cluster/cluster_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@dipshit dipshit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed PR description & good test

cluster/cluster_test.go Show resolved Hide resolved
@JackyChiu JackyChiu merged commit 1ed9130 into master Mar 25, 2020
@JackyChiu JackyChiu deleted the learners branch March 25, 2020 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do we need to remove node on failure?
3 participants