-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic and dueling candidate livelock during cluster reconfiguration on v2.0.11 #2904
Comments
I was a bit annoyed that my memory wasn't better when recalling the story up to this crash, so I made a fuzzer based on what I did know. It's totally buggy and bad in general, but what do you know! A different crash!
|
In case 2, you misconfigured your cluster. You should never restart a server without its previous log. etcd tried its best to detect this before starting (by looking for other's to ask its previous status). But in your case, all the peers were done and it could not detect this bad case. When you restarted the member, the log should print out |
@spacejam For case 1, it should be a bug. We are fixing it. |
crash #3 on the initial configuration of a brand-new cluster, not intentionally misconfigured
|
@spacejam Well... This is the same thing as 1. And I have confirmed this is a bug... And we are fixing. Thanks for your reporting. etcd cannot survive from misconfiguration. So we will not be able to fix 2. |
@spacejam Basically:
If you follow this rules, anything else should be a bug. |
Ok, I'll have my tool follow those constraints and see if anything else pops out. |
@spacejam Thank you! |
@spacejam Any update? |
The bug has been fixed. Feel free to open new issue if anything else pops out. |
Story:
etcd-4 was started with
--initial-cluster="etcd-1=http://localhost:4001,etcd-2=http://localhost:4002,etcd-3=http://localhost:4003,etcd-4=http://localhost:4004"
I killed etcd-1 before using the member remove command, and before bringing up etcd-4. I no longer have the mapping between nodes and hashes, but I think etcd-1 was 5d814446642db07a. I'm not sure if I started etcd-4 before or after removing etcd-1, but etcd-1 was down and a new leader had been elected. I have a memory of a blank name for one of the nodes when doing an etcdctl member list, but that may have been a previous cluster I had set up.
Our dearly departed etcd-2 had these dying words for us:
The cluster then went into an infinite dueling-candidate loop, livelocking the poor fella. Not sure what specifically is to blame for that, but RAFT is particularly vulnerable to it when certain types of partitions arise.
The text was updated successfully, but these errors were encountered: