Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic and dueling candidate livelock during cluster reconfiguration on v2.0.11 #2904

Closed
spacejam opened this issue Jun 2, 2015 · 10 comments
Closed

Comments

@spacejam
Copy link
Contributor

spacejam commented Jun 2, 2015

Story:

λ bin/etcdctl -C http://localhost:5001 member add etcd-2 http://localhost:4002
  Added member named etcd-2 with ID c98fe2142d42273e to cluster

  ETCD_NAME="etcd-2"
  ETCD_INITIAL_CLUSTER="etcd-1=http://localhost:4001,etcd-2=http://localhost:4002"
  ETCD_INITIAL_CLUSTER_STATE="existing"
λ bin/etcdctl -C http://localhost:5001 member add etcd-3 http://localhost:4003
  Added member named etcd-3 with ID 82874780665ea2b4 to cluster

  ETCD_NAME="etcd-3"
  ETCD_INITIAL_CLUSTER="etcd-1=http://localhost:4001,etcd-3=http://localhost:4003,etcd-2=http://localhost:4002"
  ETCD_INITIAL_CLUSTER_STATE="existing"
λ bin/etcdctl -C http://localhost:5001 member add etcd-4 http://localhost:4004
  dial tcp 127.0.0.1:5001: connection refused
λ bin/etcdctl -C http://localhost:5002 member add etcd-4 http://localhost:4004
  Added member named etcd-4 with ID 9a0e534afd267794 to cluster

  ETCD_NAME="etcd-4"
  ETCD_INITIAL_CLUSTER="etcd-1=http://localhost:4001,etcd-3=http://localhost:4003,etcd-4=http://localhost:4004,etcd-2=http://localhost:4002"
  ETCD_INITIAL_CLUSTER_STATE="existing"
λ bin/etcdctl -C http://localhost:5002 member remove etcd-1 http://localhost:4001
  Provide a single member ID
λ bin/etcdctl -C http://localhost:5002 member remove etcd-1
  Found a member named etcd-1; if this is correct, please use its ID, eg:
          etcdctl member remove 5d814446642db07a
  For more details, read the documentation at https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#remove-a-member

  Couldn't find a member in the cluster with an ID of etcd-1.
λ bin/etcdctl -C http://localhost:5002 member remove http://localhost:4001
  Couldn't find a member in the cluster with an ID of http://localhost:4001.
λ bin/etcdctl -C http://localhost:5002 member remove 9a0e534afd267794
  Recieved an error trying to remove member 9a0e534afd267794: context deadline exceeded
λ bin/etcdctl -C http://localhost:5002 member remove 9a0e534afd267794
  Recieved an error trying to remove member 9a0e534afd267794: context deadline exceeded
λ bin/etcdctl -C http://localhost:5003 member remove 9a0e534afd267794
  Recieved an error trying to remove member 9a0e534afd267794: context deadline exceeded
λ bin/etcdctl -C http://localhost:5003 member remove 5d814446642db07a
  !!!! BANG !!!!

etcd-4 was started with --initial-cluster="etcd-1=http://localhost:4001,etcd-2=http://localhost:4002,etcd-3=http://localhost:4003,etcd-4=http://localhost:4004"

I killed etcd-1 before using the member remove command, and before bringing up etcd-4. I no longer have the mapping between nodes and hashes, but I think etcd-1 was 5d814446642db07a. I'm not sure if I started etcd-4 before or after removing etcd-1, but etcd-1 was down and a new leader had been elected. I have a memory of a blank name for one of the nodes when doing an etcdctl member list, but that may have been a previous cluster I had set up.

Our dearly departed etcd-2 had these dying words for us:

2015/06/01 16:12:21 sender: the connection with 9a0e534afd267794 became active
2015/06/01 16:12:21 raft: c98fe2142d42273e received msgApp rejection(lastindex: 0) from 9a0e534afd267794 for index 361
2015/06/01 16:12:21 raft: c98fe2142d42273e decreased progress of 9a0e534afd267794 to [next = 1, match = 0, wait = 0]
2015/06/01 16:12:21 rafthttp: server streaming to 9a0e534afd267794 at term 3 has been stopped
2015/06/01 16:12:21 etcdserver: removed member 9a0e534afd267794 from cluster 40ff005f5abc8125
2015/06/01 16:12:21 rafthttp: server streaming to 82874780665ea2b4 at term 3 has been stopped
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x85025]

goroutine 20 [running]:
github.com/coreos/etcd/etcdserver.(*Cluster).UpdateAttributes(0x208743100, 0x9a0e534afd267794, 0x208815b50, 0x6, 0x2088c5dc0, 0x1, 0x4)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/cluster.go:332 +0xe5
github.com/coreos/etcd/etcdserver.(*EtcdServer).applyRequest(0x208780200, 0x944db165b0460001, 0x208815b34, 0x3, 0x20874b8f0, 0x26, 0x2088c5d40, 0x38, 0x0, 0x67c2b8, ...)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/server.go:745 +0xb05
github.com/coreos/etcd/etcdserver.(*EtcdServer).apply(0x208780200, 0x2088d65d0, 0x155, 0x3eb, 0x20874bd40, 0x0, 0x0)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/server.go:698 +0x154
github.com/coreos/etcd/etcdserver.(*EtcdServer).run(0x208780200)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/server.go:433 +0xf28
created by github.com/coreos/etcd/etcdserver.(*EtcdServer).start
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/server.go:305 +0x27c

The cluster then went into an infinite dueling-candidate loop, livelocking the poor fella. Not sure what specifically is to blame for that, but RAFT is particularly vulnerable to it when certain types of partitions arise.

@spacejam spacejam changed the title panic during cluster reconfiguration on v2.0.11 panic and dueling candidate livelock during cluster reconfiguration on v2.0.11 Jun 2, 2015
@spacejam
Copy link
Contributor Author

spacejam commented Jun 2, 2015

I was a bit annoyed that my memory wasn't better when recalling the story up to this crash, so I made a fuzzer based on what I did know. It's totally buggy and bad in general, but what do you know! A different crash!
fuzzer code:
https://github.com/spacejam/etcd_conf_fuzz/blob/master/etcfuzz.rb
new crash log from fuzzer:
https://github.com/spacejam/etcd_conf_fuzz/blob/master/crashlog

invoking
bin/etcd --data-dir="d3"
        --name="etcd-3"
        --initial-cluster-state="existing"
        --listen-peer-urls="http://localhost:4003"
        --initial-advertise-peer-urls="http://localhost:4003"
        --listen-client-urls="http://localhost:5003"
        --advertise-client-urls="http://localhost:5003"
        --initial-cluster="etcd-1=http://localhost:4001,etcd-2=http://localhost:4002,etcd-3=http://localhost:4003,etcd-4=http://localhost:4004"
2015/06/01 20:06:02 etcd: listening for peers on http://localhost:4003
2015/06/01 20:06:02 etcd: listening for client requests on http://localhost:5003
2015/06/01 20:06:02 etcdserver: datadir is valid for the 2.0.1 format
2015/06/01 20:06:02 etcdserver: could not get cluster response from http://localhost:4001: Get http://localhost:4001/members: dial tcp 127.0.0.1:4001: connection refused
2015/06/01 20:06:02 etcdserver: could not get cluster response from http://localhost:4002: Get http://localhost:4002/members: dial tcp 127.0.0.1:4002: connection refused
2015/06/01 20:06:02 etcdserver: name = etcd-3
2015/06/01 20:06:02 etcdserver: data dir = d3
2015/06/01 20:06:02 etcdserver: member dir = d3/member
2015/06/01 20:06:02 etcdserver: heartbeat = 100ms
2015/06/01 20:06:02 etcdserver: election = 1000ms
2015/06/01 20:06:02 etcdserver: snapshot count = 10000
2015/06/01 20:06:02 etcdserver: advertise client URLs = http://localhost:5003
2015/06/01 20:06:02 etcdserver: start member 44a37f7441568b8b in cluster 40ff005f5abc8125
2015/06/01 20:06:02 raft: 44a37f7441568b8b became follower at term 0
2015/06/01 20:06:02 raft: newRaft 44a37f7441568b8b [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2015/06/01 20:06:02 raft: 44a37f7441568b8b became follower at term 1
2015/06/01 20:06:02 raft: 44a37f7441568b8b [term: 1] received a MsgHeartbeat message with higher term from 136b13cbfb2c4e73 [term: 18]
2015/06/01 20:06:02 raft: 44a37f7441568b8b became follower at term 18
2015/06/01 20:06:02 sender: the connection with 44a37f7441568b8b became active
2015/06/01 20:06:02 tocommit(98) is out of range [lastIndex(0)]
panic: tocommit(98) is out of range [lastIndex(0)]

goroutine 15 [running]:
log.Panicf(0x4e2ad0, 0x2c, 0x2087f4510, 0x2, 0x2)
        /usr/local/go/src/log/log.go:314 +0xd0
github.com/coreos/etcd/raft.(*raftLog).commitTo(0x208746b90, 0x62)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/log.go:177 +0x179
github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0x208748420, 0x8, 0x44a37f7441568b8b, 0x136b13cbfb2c4e73, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/raft.go:593 +0x47
github.com/coreos/etcd/raft.stepFollower(0x208748420, 0x8, 0x44a37f7441568b8b, 0x136b13cbfb2c4e73, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/raft.go:563 +0xfdd
github.com/coreos/etcd/raft.(*raft).Step(0x208748420, 0x8, 0x44a37f7441568b8b, 0x136b13cbfb2c4e73, 0x12, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/raft.go:461 +0x23f
github.com/coreos/etcd/raft.(*node).run(0x208746b40, 0x208748420)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/node.go:275 +0x689
created by github.com/coreos/etcd/raft.StartNode
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/raft/node.go:168 +0x50a

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@spacejam

In case 2, you misconfigured your cluster. You should never restart a server without its previous log.

etcd tried its best to detect this before starting (by looking for other's to ask its previous status). But in your case, all the peers were done and it could not detect this bad case.

When you restarted the member, the log should print out restart instead of start. And from raft panic log, it also indicates the member was not restarted and lost its previous state.

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@spacejam For case 1, it should be a bug. We are fixing it.

@spacejam
Copy link
Contributor Author

spacejam commented Jun 2, 2015

crash #3 on the initial configuration of a brand-new cluster, not intentionally misconfigured
https://github.com/spacejam/etcd_conf_fuzz/blob/master/crashlog2

configuring 3
2015/06/01 21:50:44 etcdserver: added member bf86d3c2d1afaed3 [http://localhost:4003] to cluster 40ff005f5abc8125
got Added member named etcd-3 with ID bf86d3c2d1afaed3 to cluster

ETCD_NAME="etcd-3"
ETCD_INITIAL_CLUSTER="etcd-1=http://localhost:4001,etcd-3=http://localhost:4003,etcd-2=http://localhost:4002"
ETCD_INITIAL_CLUSTER_STATE="existing"
 of len 230
configuring bf86d3c2d1afaed3
invoking
bin/etcd --data-dir="d3"
        --name="etcd-3"
        --initial-cluster-state="existing"
        --listen-peer-urls="http://localhost:4003"
        --initial-advertise-peer-urls="http://localhost:4003"
        --listen-client-urls="http://localhost:5003"
        --advertise-client-urls="http://localhost:5003"
        --initial-cluster="etcd-1=http://localhost:4001,etcd-2=http://localhost:4002,etcd-3=http://localhost:4003"
2015/06/01 21:50:44 sender: error posting to bf86d3c2d1afaed3: dial tcp 127.0.0.1:4003: connection refused
2015/06/01 21:50:44 sender: the connection with bf86d3c2d1afaed3 became inactive
2015/06/01 21:50:44 etcd: listening for peers on http://localhost:4003
2015/06/01 21:50:44 etcd: listening for client requests on http://localhost:5003
2015/06/01 21:50:44 etcdserver: datadir is valid for the 2.0.1 format
2015/06/01 21:50:44 etcdserver: name = etcd-3
2015/06/01 21:50:44 etcdserver: data dir = d3
2015/06/01 21:50:44 etcdserver: member dir = d3/member
2015/06/01 21:50:44 etcdserver: heartbeat = 100ms
2015/06/01 21:50:44 etcdserver: election = 1000ms
2015/06/01 21:50:44 etcdserver: snapshot count = 10000
2015/06/01 21:50:44 etcdserver: advertise client URLs = http://localhost:5003
2015/06/01 21:50:44 etcdserver: start member bf86d3c2d1afaed3 in cluster 40ff005f5abc8125
2015/06/01 21:50:44 raft: bf86d3c2d1afaed3 became follower at term 0
2015/06/01 21:50:44 raft: newRaft bf86d3c2d1afaed3 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
2015/06/01 21:50:44 raft: bf86d3c2d1afaed3 became follower at term 1
2015/06/01 21:50:44 etcdserver: added member bf86d3c2d1afaed3 [http://localhost:4003] to cluster 40ff005f5abc8125
2015/06/01 21:50:44 raft: bf86d3c2d1afaed3 [term: 1] received a MsgHeartbeat message with higher term from 5d814446642db07a [term: 2]
2015/06/01 21:50:44 raft: bf86d3c2d1afaed3 became follower at term 2
2015/06/01 21:50:44 raft.node: bf86d3c2d1afaed3 elected leader 5d814446642db07a at term 2
2015/06/01 21:50:44 sender: the connection with bf86d3c2d1afaed3 became active
2015/06/01 21:50:44 raft: bf86d3c2d1afaed3 [logterm: 0, index: 18] rejected msgApp [logterm: 2, index: 18] from 5d814446642db07a
2015/06/01 21:50:44 raft: 5d814446642db07a received msgApp rejection(lastindex: 0) from bf86d3c2d1afaed3 for index 18
2015/06/01 21:50:44 raft: 5d814446642db07a decreased progress of bf86d3c2d1afaed3 to [next = 1, match = 0, wait = 0]
2015/06/01 21:50:44 etcdserver: removed member bf86d3c2d1afaed3 from cluster 40ff005f5abc8125
2015/06/01 21:50:44 rafthttp: server streaming to c10464c025debefb at term 2 has been stopped
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x85025]


goroutine 8 [running]:
github.com/coreos/etcd/etcdserver.(*Cluster).UpdateAttributes(0x208743100, 0xbf86d3c2d1afaed3, 0x20883cce0, 0x6, 0x208742b80, 0x1, 0x4)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/cluster.go:332 +0xe5
github.com/coreos/etcd/etcdserver.(*EtcdServer).applyRequest(0x208784200, 0xd34db29b7cf00001, 0x20883ccc0, 0x3, 0x20882d8f0, 0x26, 0x208742b00, 0x38, 0x0, 0x67c2b8, ...)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/gopath/src/github.com/coreos/etcd/etcdserver/server.go:745 +0xb05
github.com/coreos/etcd/etcdserver.(*EtcdServer).apply(0x208784200, 0x208721758, 0x2, 0xd, 0x20874b050, 0x0, 0x0)
        /Users/unihorn/gopath/src/github.com/coreos/etcd/scripts/release/etcd/g2015/06/01 21:50:44 rafthttp: client streaming to 5d814446642db07a at term 2 hgot Removed member bf86d3c2d1afaed3 from cluster

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@spacejam Well... This is the same thing as 1. And I have confirmed this is a bug... And we are fixing. Thanks for your reporting.

etcd cannot survive from misconfiguration. So we will not be able to fix 2.

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@spacejam Basically:

  1. when you remove a member, you have to remove its data dir before restarting.
  2. when you restart an existing member, you have to use its previous data dir.

If you follow this rules, anything else should be a bug.

@spacejam
Copy link
Contributor Author

spacejam commented Jun 2, 2015

Ok, I'll have my tool follow those constraints and see if anything else pops out.

@xiang90
Copy link
Contributor

xiang90 commented Jun 2, 2015

@spacejam Thank you!

@xiang90
Copy link
Contributor

xiang90 commented Jun 8, 2015

@spacejam Any update?

@yichengq
Copy link
Contributor

The bug has been fixed. Feel free to open new issue if anything else pops out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants