Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't start second etcd cluster #12623

Closed
vladloskut opened this issue Jan 14, 2021 · 11 comments
Closed

Can't start second etcd cluster #12623

vladloskut opened this issue Jan 14, 2021 · 11 comments

Comments

@vladloskut
Copy link

Hello, I am trying to create a second etcd cluster. Here is my infrastructure

ETCD cluster 1:

Node #1 PG + patroni + etcd
Node #2 PG + patroni + etcd
Node #3 etcd only

ETCD cluster 2:

Node #4 PG + patroni + etcd
Node #5 PG + patroni + etcd
Node #3 etcd only

So as you can see the second cluster is using node #3 to create a quorum

ECTD cluster 1 running with no problems but when I am trying to launch second cluster I am getting the following error on node #4 and #5:

request cluster ID mismatch (got A want B)

I did use google search but can't find any idea on how to run 2 ETCD clusters

For ETCD cluster 2 I changed ports for node #3 and also created separate systemd service

Please help me

@ptabor
Copy link
Contributor

ptabor commented Jan 14, 2021

Please put here exact (might be obfuscated, but representative) command lines how you run the etcd instances, and what's the exact error message.

@vladloskut
Copy link
Author

ETCD_NAME="pg_node_1"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.135:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.135:2379"
ETCD_LISTEN_PEER_URLS="http://10.105.241.135:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.135:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_1"
ETCD_INITIAL_CLUSTER="pg_node_1=http://10.105.241.135:2380,pg_node_2=http://10.105.241.137:2380,etcd_node_only=http://10.105.241.142:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

ETCD_NAME="pg_node_2"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.137:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.137:2379"
ETCD_LISTEN_PEER_URLS="http://10.105.241.137:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.137:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_1"
ETCD_INITIAL_CLUSTER="pg_node_1=http://10.105.241.135:2380,pg_node_2=http://10.105.241.137:2380,etcd_node_only=http://10.105.241.142:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

Service #1

ETCD_NAME="etcd_node_only"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.142:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.142:2379"
ETCD_LISTEN_PEER_URLS="http://10.105.241.142:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.142:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_1"
ETCD_INITIAL_CLUSTER="pg_node_1=http://10.105.241.135:2380,pg_node_2=http://10.105.241.137:2380,etcd_node_only=http://10.105.241.142:2380"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

Service #2

ETCD_NAME="etcd_node_only_2"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.142:2378,http://127.0.0.1:2378"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.142:2378"
ETCD_LISTEN_PEER_URLS="http://10.105.241.142:2381"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.142:2381"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_2"
ETCD_INITIAL_CLUSTER="pg_node_1=http://10.105.241.135:2380,pg_node_2=http://10.105.241.137:2380,etcd_node_only_2=http://10.105.241.142:2381"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/etcd_2"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

ETCD_NAME="db_node_3"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.119:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.119:2379"
ETCD_LISTEN_PEER_URLS="http://10.105.241.119:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.119:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_2"
ETCD_INITIAL_CLUSTER="db_node_3=http://10.105.241.119:2380,db_node_4=http://10.105.241.120:2380,etcd_node_only_2=http://10.105.241.142:2381"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/another_etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

ETCD_NAME="db_node_4"
ETCD_LISTEN_CLIENT_URLS="http://10.105.241.119:2379,http://127.0.0.1:2379"
ETCD_ADVERTISE_CLIENT_URLS="http://10.105.241.119:2379"
ETCD_LISTEN_PEER_URLS="http://10.105.241.119:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://10.105.241.119:2380"
ETCD_INITIAL_CLUSTER_TOKEN="cluster_2"
ETCD_INITIAL_CLUSTER="db_node_3=http://10.105.241.119:2380,db_node_4=http://10.105.241.120:2380,etcd_node_only_2=http://10.105.241.142:2381"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_DATA_DIR="/var/lib/another_etcd"
ETCD_ELECTION_TIMEOUT="5000"
ETCD_HEARTBEAT_INTERVAL="1000"

@vladloskut
Copy link
Author

Please put here exact (might be obfuscated, but representative) command lines how you run the etcd instances, and what's the exact error message.

Posted above. Tried to build again. Every time I use clean ETCD_DATA_DIR and ETCD_INITIAL_CLUSTER_TOKEN

@ptabor
Copy link
Contributor

ptabor commented Jan 19, 2021

Why node "http://10.105.241.135:2380" spans both services ?

It needs to belong to one or the other cluster, but not to both.

@vladloskut
Copy link
Author

Why node "http://10.105.241.135:2380" spans both services ?

It needs to belong to one or the other cluster, but not to both.

My bad.

Here is Service 2 config:

ETCD_INITIAL_CLUSTER="db_node_3=http://10.105.241.119:2380,db_node_4=http://10.105.241.120:2380,etcd_node=http://10.105.241.142:2381"

@ptabor
Copy link
Contributor

ptabor commented Jan 19, 2021

ETCD_LISTEN_PEER_URLS="http://10.105.241.119:2380" seems to be shared between 4. & 5.

Either way, the warning is printed if one RAFT nodes receives a RAFT message that is targetted for another node.
So it's most likely you still have some cross-cluster IP:port mismatch in cluster configuration.
You need to fully separate the spaces between the clusters.

@vladloskut
Copy link
Author

ETCD_INITIAL_CLUSTER is the same for Node 1,2 and service 1 on node 3
ETCD_INITIAL_CLUSTER is the same for Node 4,5 and service 2 on node 3

data_dir's were empty before start

ETCD_INITIAL_CLUSTER_TOKEN is the same and unique for Node 1,2 and service 1 on node 3
ETCD_INITIAL_CLUSTER_TOKEN is the same and unique for Node 4,5 and service 2 on node 3

ETCD_LISTEN_PEER_URLS="http://10.105.241.119:2380" seems to be shared between 4. & 5. <<<<<

My bad too, sorry, a bit tired of this thing...

@vladloskut
Copy link
Author

/usr/local/bin/another_etcdctl member list --write-out=table

+------------------+---------+----------------+----------------------------+----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------+----------------------------+----------------------------+------------+
| 1a4ab720ca2c1494 | started | etcd_node_only | http://10.105.241.142:2380 | http://10.105.241.142:2379 | false |
| 52ae9c81e8329fa7 | started | pg_node_2 | http://10.105.241.137:2380 | http://10.105.241.137:2379 | false |
| f8ed7525a14a4fb4 | started | pg_node_1 | http://10.105.241.135:2380 | http://10.105.241.135:2379 | false |
+------------------+---------+----------------+----------------------------+----------------------------+------------+

This is output from Node 3

How to query another cluster?

/usr/local/bin/another_etcdctl endpoint status --write-out=table --endpoints=10.105.241.119:2379,10.105.241.120:2379,10.105.241.142:2378 member list

+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 10.105.241.120:2379 | 923207833d9ad5e1 | 3.4.14 | 20 kB | false | false | 123 | 2973 | 2973 | |
| 10.105.241.142:2378 | 53f80f7fb22dbc34 | 3.4.14 | 20 kB | true | false | 123 | 2973 | 2973 | |
+---------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

@vladloskut
Copy link
Author

Hmm..I guess I found a way to resolve this issue

I used 2378 and 2381 on nodes 4 and 5 + Service 2 on node 3 and all nodes connected immediately

On more question. How to check that 2 ETCD cluster are really healthy? Help please :)

@ptabor
Copy link
Contributor

ptabor commented Jan 19, 2021

If you can write & all nodes have the same RAFT APPLIED INDEX after it, this is good indicator that all of them are connected.

@ptabor
Copy link
Contributor

ptabor commented Jan 19, 2021

Assuming the issue is solved.

@ptabor ptabor closed this as completed Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants