Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB cluster TLS not work on EKS #1685

Closed
tennix opened this issue Feb 12, 2020 · 14 comments
Closed

TiDB cluster TLS not work on EKS #1685

tennix opened this issue Feb 12, 2020 · 14 comments
Assignees
Milestone

Comments

@tennix
Copy link
Member

tennix commented Feb 12, 2020

Bug Report

The EKS generated certificate Subject only contains Subject: O=PingCAP, OU=TiDB Operator, CN=tls-cluster-pd and no SAN (-pd-1.-pd-peer..svc), thus causing the cluster failed to bootstrap. The detailed PD log is as follows:

[2020/02/12 06:13:00.332 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43400] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:00.336 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.23.255:36960] [server-name=tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:00.336 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\". Reconnecting..."]
[2020/02/12 06:13:00.676 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43424] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:01.343 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.23.255:36962] [server-name=tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:01.343 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\". Reconnecting..."]
[2020/02/12 06:13:01.348 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43432] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:02.737 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.30.0:48058] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:02.891 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\". Reconnecting..."]
[2020/02/12 06:13:02.891 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.23.255:36964] [server-name=tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:04.692 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43450] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:05.365 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43456] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:05.380 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43458] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:05.686 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\". Reconnecting..."]
[2020/02/12 06:13:05.686 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.23.255:36978] [server-name=tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:05.709 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43464] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:07.398 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43478] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:09.724 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43492] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:09.739 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.26.61:43496] [server-name=tls-cluster-pd.tls-cluster] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:09.901 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.23.255:36980] [server-name=tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc] [error="remote error: tls: bad certificate"]
[2020/02/12 06:13:09.901 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\". Reconnecting..."]
{"level":"warn","ts":"2020-02-12T06:13:10.325Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-47ec7372-2d94-436a-aeca-8c491d8525a0/tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tls-cluster-pd, not tls-cluster-pd-1.tls-cluster-pd-peer.tls-cluster.svc\""}
[2020/02/12 06:13:10.325 +00:00] [FATAL] [main.go:117] ["run server failed"] [error="context deadline exceeded"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59\nmain.main\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:117\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]

There is a similar issue here awslabs/amazon-eks-ami#341. And cloudfoundry-operator uses a workaround approach: cloudfoundry-incubator/quarks-operator@cbab593

This bug can be reproduced in 1.12 ~ 1.14(current latest). We should consider adopting the similar workaround approach as cf-operator.

@tennix tennix added the type/bug Something isn't working label Feb 12, 2020
@kolbe
Copy link
Contributor

kolbe commented Feb 12, 2020

A workaround to this could be to bake authentication/identity information into the certificate's subject instead of its SAN.

@weekface
Copy link
Contributor

weekface commented Feb 20, 2020

Created a cluster with the following, enableTLSCluster set to true:

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
 name: demo
 namespace: demo
spec:
 version: v3.0.8
 timezone: UTC
 pvReclaimPolicy: Delete
 enableTLSCluster: true
 pd:
   baseImage: pingcap/pd
   replicas: 3
   requests:
     storage: "1Gi"
   config: {}
 tikv:
   baseImage: pingcap/tikv
   replicas: 1
   requests:
     storage: "1Gi"
   config: {}
 tidb:
   baseImage: pingcap/tidb
   replicas: 1
   service:
     type: ClusterIP
   config: {}

The PDs failed to start:

[2020/02/20 10:58:41.484 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2020/02/20 10:58:41.484 +00:00] [INFO] [util.go:60] [PD] [release-version=v3.0.8]
[2020/02/20 10:58:41.484 +00:00] [INFO] [util.go:61] [PD] [git-hash=456c42b8b0955b33426b58054e43b771801a74d0]
[2020/02/20 10:58:41.484 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2020/02/20 10:58:41.484 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-12-31 11:13:16"]
[2020/02/20 10:58:41.484 +00:00] [INFO] [metricutil.go:81] ["disable Prometheus push client"]
[2020/02/20 10:58:41.484 +00:00] [INFO] [server.go:110] ["PD Config"] [config="{\"client-urls\":\"https://0.0.0.0:2379\",\"peer-urls\":\"https://0.0.0.0:2380\",\"advertise-client-urls\":\"https://demo-pd-1.demo-pd-peer.demo.svc:2379\",\"advertise-peer-urls\":\"https://demo-pd-1.demo-pd-peer.demo.svc:2380\",\"name\":\"demo-pd-1\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"enable-grpc-gateway\":true,\"initial-cluster\":\"demo-pd-1=https://demo-pd-1.demo-pd-peer.demo.svc:2380\",\"initial-cluster-state\":\"new\",\"join\":\"\",\"lease\":3,\"log\":{\"level\":\"\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"log-rotate\":true,\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null},\"log-file\":\"\",\"log-level\":\"\",\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"demo-pd-1\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":20,\"max-merge-region-keys\":200000,\"split-merge-interval\":\"1h0m0s\",\"enable-one-way-merge\":\"false\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"region-schedule-limit\":64,\"replica-schedule-limit\":64,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":4,\"hot-region-cache-hits-threshold\":3,\"store-balance-rate\":15,\"tolerant-size-ratio\":0,\"low-space-ratio\":0.8,\"high-space-ratio\":0.6,\"scheduler-max-waiting-operator\":3,\"disable-raft-learner\":\"false\",\"disable-remove-down-replica\":\"false\",\"disable-replace-offline-replica\":\"false\",\"disable-make-up-replica\":\"false\",\"disable-remove-extra-replica\":\"false\",\"disable-location-replacement\":\"false\",\"disable-namespace-relocation\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false},{\"type\":\"hot-region\",\"args\":null,\"disable\":false},{\"type\":\"label\",\"args\":null,\"disable\":false}]},\"replication\":{\"max-replicas\":3,\"location-labels\":\"\",\"strictly-match-label\":\"false\"},\"namespace\":{},\"pd-server\":{\"use-region-storage\":\"true\"},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"0B\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"\",\"cert-path\":\"\",\"key-path\":\"\"},\"label-property\":null,\"WarningMsgs\":null,\"namespace-classifier\":\"table\",\"LeaderPriorityCheckInterval\":\"1m0s\"}"]
[2020/02/20 10:58:41.489 +00:00] [INFO] [server.go:145] ["start embed etcd"]
[2020/02/20 10:58:41.489 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[https://0.0.0.0:2380]"]
[2020/02/20 10:58:41.490 +00:00] [INFO] [etcd.go:360] ["closing etcd server"] [name=demo-pd-1] [data-dir=/var/lib/pd] [advertise-peer-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2380]"] [advertise-client-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2379]"]
[2020/02/20 10:58:41.490 +00:00] [INFO] [etcd.go:364] ["closed etcd server"] [name=demo-pd-1] [data-dir=/var/lib/pd] [advertise-peer-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2380]"] [advertise-client-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2379]"]
[2020/02/20 10:58:41.490 +00:00] [FATAL] [main.go:117] ["run server failed"] [error="cannot listen on TLS for [::]:2380: KeyFile and CertFile are not presented"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59\nmain.main\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:117\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]

The configuration file is empty:

/var/lib/pd # cat /etc/pd/pd.toml
/var/lib/pd #

@weekface
Copy link
Contributor

I followed this commit: cloudfoundry-incubator/quarks-operator@cbab593 to change Common Name to the ClusterIP:

# openssl x509 -in cert  -noout -text | grep CN
        Issuer: CN=kubernetes
        Subject: O=PingCAP, OU=TiDB Operator, CN=172.20.185.55

But that doesn't work either.

[2020/02/20 12:47:33.165 +00:00] [INFO] [util.go:59] ["Welcome to Placement Driver (PD)"]
[2020/02/20 12:47:33.165 +00:00] [INFO] [util.go:60] [PD] [release-version=v3.0.8]
[2020/02/20 12:47:33.165 +00:00] [INFO] [util.go:61] [PD] [git-hash=456c42b8b0955b33426b58054e43b771801a74d0]
[2020/02/20 12:47:33.165 +00:00] [INFO] [util.go:62] [PD] [git-branch=HEAD]
[2020/02/20 12:47:33.166 +00:00] [INFO] [util.go:63] [PD] [utc-build-time="2019-12-31 11:13:16"]
[2020/02/20 12:47:33.166 +00:00] [INFO] [metricutil.go:81] ["disable Prometheus push client"]
[2020/02/20 12:47:33.166 +00:00] [INFO] [server.go:110] ["PD Config"] [config="{\"client-urls\":\"https://0.0.0.0:2379\",\"peer-urls\":\"https://0.0.0.0:2380\",\"advertise-client-urls\":\"https://demo-pd-1.demo-pd-peer.demo.svc:2379\",\"advertise-peer-urls\":\"https://demo-pd-1.demo-pd-peer.demo.svc:2380\",\"name\":\"demo-pd-1\",\"data-dir\":\"/var/lib/pd\",\"force-new-cluster\":false,\"enable-grpc-gateway\":true,\"initial-cluster\":\"demo-pd-1=https://demo-pd-1.demo-pd-peer.demo.svc:2380\",\"initial-cluster-state\":\"new\",\"join\":\"\",\"lease\":3,\"log\":{\"level\":\"\",\"format\":\"text\",\"disable-timestamp\":false,\"file\":{\"filename\":\"\",\"log-rotate\":true,\"max-size\":0,\"max-days\":0,\"max-backups\":0},\"development\":false,\"disable-caller\":false,\"disable-stacktrace\":false,\"disable-error-verbose\":true,\"sampling\":null},\"log-file\":\"\",\"log-level\":\"\",\"tso-save-interval\":\"3s\",\"metric\":{\"job\":\"demo-pd-1\",\"address\":\"\",\"interval\":\"15s\"},\"schedule\":{\"max-snapshot-count\":3,\"max-pending-peer-count\":16,\"max-merge-region-size\":20,\"max-merge-region-keys\":200000,\"split-merge-interval\":\"1h0m0s\",\"enable-one-way-merge\":\"false\",\"patrol-region-interval\":\"100ms\",\"max-store-down-time\":\"30m0s\",\"leader-schedule-limit\":4,\"region-schedule-limit\":64,\"replica-schedule-limit\":64,\"merge-schedule-limit\":8,\"hot-region-schedule-limit\":4,\"hot-region-cache-hits-threshold\":3,\"store-balance-rate\":15,\"tolerant-size-ratio\":0,\"low-space-ratio\":0.8,\"high-space-ratio\":0.6,\"scheduler-max-waiting-operator\":3,\"disable-raft-learner\":\"false\",\"disable-remove-down-replica\":\"false\",\"disable-replace-offline-replica\":\"false\",\"disable-make-up-replica\":\"false\",\"disable-remove-extra-replica\":\"false\",\"disable-location-replacement\":\"false\",\"disable-namespace-relocation\":\"false\",\"schedulers-v2\":[{\"type\":\"balance-region\",\"args\":null,\"disable\":false},{\"type\":\"balance-leader\",\"args\":null,\"disable\":false},{\"type\":\"hot-region\",\"args\":null,\"disable\":false},{\"type\":\"label\",\"args\":null,\"disable\":false}]},\"replication\":{\"max-replicas\":3,\"location-labels\":\"\",\"strictly-match-label\":\"false\"},\"namespace\":{},\"pd-server\":{\"use-region-storage\":\"true\"},\"cluster-version\":\"0.0.0\",\"quota-backend-bytes\":\"0B\",\"auto-compaction-mode\":\"periodic\",\"auto-compaction-retention-v2\":\"1h\",\"TickInterval\":\"500ms\",\"ElectionInterval\":\"3s\",\"PreVote\":true,\"security\":{\"cacert-path\":\"/var/run/secrets/kubernetes.io/serviceaccount/ca.crt\",\"cert-path\":\"/var/lib/pd-tls/cert\",\"key-path\":\"/var/lib/pd-tls/key\"},\"label-property\":null,\"WarningMsgs\":null,\"namespace-classifier\":\"table\",\"LeaderPriorityCheckInterval\":\"1m0s\"}"]
[2020/02/20 12:47:33.168 +00:00] [INFO] [server.go:145] ["start embed etcd"]
[2020/02/20 12:47:33.168 +00:00] [INFO] [etcd.go:117] ["configuring peer listeners"] [listen-peer-urls="[https://0.0.0.0:2380]"]
[2020/02/20 12:47:33.168 +00:00] [INFO] [etcd.go:465] ["starting with peer TLS"] [tls-info="cert = /var/lib/pd-tls/cert, key = /var/lib/pd-tls/key, trusted-ca = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, client-cert-auth = false, crl-file = "] [cipher-suites="[]"]
[2020/02/20 12:47:33.169 +00:00] [INFO] [etcd.go:127] ["configuring client listeners"] [listen-client-urls="[https://0.0.0.0:2379]"]
[2020/02/20 12:47:33.169 +00:00] [INFO] [etcd.go:602] ["pprof is enabled"] [path=/debug/pprof]
[2020/02/20 12:47:33.169 +00:00] [INFO] [systime_mon.go:25] ["start system time monitor"]
[2020/02/20 12:47:33.169 +00:00] [INFO] [etcd.go:299] ["starting an etcd server"] [etcd-version=3.4.3] [git-sha="Not provided (use ./build instead of go build)"] [go-version=go1.12] [go-os=linux] [go-arch=amd64] [max-cpu-set=2] [max-cpu-available=2] [member-initialized=true] [name=demo-pd-1] [data-dir=/var/lib/pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/var/lib/pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2380]"] [listen-peer-urls="[https://0.0.0.0:2380]"] [advertise-client-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2379]"] [listen-client-urls="[https://0.0.0.0:2379]"] [listen-metrics-urls="[]"] [cors="[*]"] [host-whitelist="[*]"] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-size-bytes=2147483648] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=]
[2020/02/20 12:47:33.170 +00:00] [INFO] [backend.go:79] ["opened backend db"] [path=/var/lib/pd/member/snap/db] [took=203.421µs]
[2020/02/20 12:47:33.171 +00:00] [INFO] [raft.go:506] ["restarting local member"] [cluster-id=de0073c6f2fdb35f] [local-member-id=be1ed4468ffd0b6f] [commit-index=28]
[2020/02/20 12:47:33.171 +00:00] [INFO] [raft.go:1530] ["be1ed4468ffd0b6f switched to configuration voters=()"]
[2020/02/20 12:47:33.171 +00:00] [INFO] [raft.go:700] ["be1ed4468ffd0b6f became follower at term 14"]
[2020/02/20 12:47:33.171 +00:00] [INFO] [raft.go:383] ["newRaft be1ed4468ffd0b6f [peers: [], term: 14, commit: 28, applied: 0, lastindex: 28, lastterm: 14]"]
[2020/02/20 12:47:33.174 +00:00] [WARN] [store.go:1317] ["simple token is not cryptographically signed"]
[2020/02/20 12:47:33.174 +00:00] [INFO] [quota.go:98] ["enabled backend quota with default value"] [quota-name=v3-applier] [quota-size-bytes=2147483648] [quota-size="2.1 GB"]
[2020/02/20 12:47:33.175 +00:00] [INFO] [server.go:792] ["starting etcd server"] [local-member-id=be1ed4468ffd0b6f] [local-server-version=3.4.3] [cluster-version=to_be_decided]
[2020/02/20 12:47:33.176 +00:00] [INFO] [raft.go:1530] ["be1ed4468ffd0b6f switched to configuration voters=(13699620516036152175)"]
[2020/02/20 12:47:33.177 +00:00] [INFO] [cluster.go:392] ["added member"] [cluster-id=de0073c6f2fdb35f] [local-member-id=be1ed4468ffd0b6f] [added-peer-id=be1ed4468ffd0b6f] [added-peer-peer-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2380]"]
[2020/02/20 12:47:33.177 +00:00] [INFO] [cluster.go:558] ["set initial cluster version"] [cluster-id=de0073c6f2fdb35f] [local-member-id=be1ed4468ffd0b6f] [cluster-version=3.4]
[2020/02/20 12:47:33.177 +00:00] [INFO] [capability.go:76] ["enabled capabilities for version"] [cluster-version=3.4]
[2020/02/20 12:47:33.179 +00:00] [INFO] [server.go:658] ["started as single-node; fast-forwarding election ticks"] [local-member-id=be1ed4468ffd0b6f] [forward-ticks=5] [forward-duration=2.5s] [election-ticks=6] [election-timeout=3s]
[2020/02/20 12:47:33.180 +00:00] [INFO] [etcd.go:708] ["starting with client TLS"] [tls-info="cert = /var/lib/pd-tls/cert, key = /var/lib/pd-tls/key, trusted-ca = /var/run/secrets/kubernetes.io/serviceaccount/ca.crt, client-cert-auth = true, crl-file = "] [cipher-suites="[]"]
[2020/02/20 12:47:33.181 +00:00] [INFO] [etcd.go:241] ["now serving peer/client/metrics"] [local-member-id=be1ed4468ffd0b6f] [initial-advertise-peer-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2380]"] [listen-peer-urls="[https://0.0.0.0:2380]"] [advertise-client-urls="[https://demo-pd-1.demo-pd-peer.demo.svc:2379]"] [listen-client-urls="[https://0.0.0.0:2379]"] [listen-metrics-urls="[]"]
[2020/02/20 12:47:33.181 +00:00] [INFO] [etcd.go:576] ["serving peer traffic"] [address="[::]:2380"]
[2020/02/20 12:47:33.186 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:37834] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
2020/02/20 12:47:33.186 log.go:86: [warning] etcdserver: [could not get cluster response from https://demo-pd-1.demo-pd-peer.demo.svc:2380: Get https://demo-pd-1.demo-pd-peer.demo.svc:2380/members: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc]
[2020/02/20 12:47:33.187 +00:00] [ERROR] [etcdutil.go:63] ["failed to get cluster from remote"] [error="could not retrieve cluster information from the given URLs"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:923] ["be1ed4468ffd0b6f is starting a new election at term 14"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:729] ["be1ed4468ffd0b6f became pre-candidate at term 14"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:824] ["be1ed4468ffd0b6f received MsgPreVoteResp from be1ed4468ffd0b6f at term 14"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:713] ["be1ed4468ffd0b6f became candidate at term 15"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:824] ["be1ed4468ffd0b6f received MsgVoteResp from be1ed4468ffd0b6f at term 15"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [raft.go:765] ["be1ed4468ffd0b6f became leader at term 15"]
[2020/02/20 12:47:33.672 +00:00] [INFO] [node.go:325] ["raft.node: be1ed4468ffd0b6f elected leader be1ed4468ffd0b6f at term 15"]
[2020/02/20 12:47:33.675 +00:00] [INFO] [server.go:2016] ["published local member to cluster through raft"] [local-member-id=be1ed4468ffd0b6f] [local-member-attributes="{Name:demo-pd-1 ClientURLs:[https://demo-pd-1.demo-pd-peer.demo.svc:2379]}"] [request-path=/0/members/be1ed4468ffd0b6f/attributes] [cluster-id=de0073c6f2fdb35f] [publish-timeout=11s]
[2020/02/20 12:47:33.675 +00:00] [INFO] [server.go:175] ["create etcd v3 client"] [endpoints="[https://demo-pd-1.demo-pd-peer.demo.svc:2379]"]
[2020/02/20 12:47:33.681 +00:00] [INFO] [serve.go:191] ["serving client traffic securely"] [address="[::]:2379"]
[2020/02/20 12:47:33.696 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:54670] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:33.696 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://demo-pd-1.demo-pd-peer.demo.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\". Reconnecting..."]
[2020/02/20 12:47:34.702 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:54676] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:34.702 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://demo-pd-1.demo-pd-peer.demo.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\". Reconnecting..."]
[2020/02/20 12:47:35.141 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51422] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:35.162 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51426] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:36.288 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:54686] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:36.288 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://demo-pd-1.demo-pd-peer.demo.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\". Reconnecting..."]
[2020/02/20 12:47:38.180 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51434] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:39.113 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51450] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:39.227 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://demo-pd-1.demo-pd-peer.demo.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\". Reconnecting..."]
[2020/02/20 12:47:39.227 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:54706] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:41.199 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51454] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:42.132 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.46.16:51458] [server-name=demo-pd.demo] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:43.193 +00:00] [WARN] [config_logging.go:279] ["rejected connection"] [remote-addr=10.0.59.185:54742] [server-name=demo-pd-1.demo-pd-peer.demo.svc] [error="remote error: tls: bad certificate"]
[2020/02/20 12:47:43.193 +00:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {https://demo-pd-1.demo-pd-peer.demo.svc:2379 0  <nil>}. Err :connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\". Reconnecting..."]
[2020/02/20 12:47:43.684 +00:00] [FATAL] [main.go:117] ["run server failed"] [error="context deadline exceeded"] [stack="github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59\nmain.main\n\t/home/jenkins/agent/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:117\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200"]
{"level":"warn","ts":"2020-02-20T12:47:43.684Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-9ceb6a36-78a9-42e1-892c-1352bf095581/demo-pd-1.demo-pd-peer.demo.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for 172.20.185.55, not demo-pd-1.demo-pd-peer.demo.svc\""}

@mightyguava
Copy link
Contributor

mightyguava commented Feb 20, 2020

I didn't understand how their workaround worked either... Maybe the cf-operator has a fallback or something that if the DNS doesn't pass validation it'll try again with the Host header set to the IP? Could be worth asking them!

@weekface
Copy link
Contributor

TLS authentication includes the following two conditions:

  • The mutual authentication between TiDB components (doesn't work on EKS);
  • And the one-way and mutual authentication between the TiDB server and the MySQL Client.

We have supported the second condition with the user-defined certificate: #1714

To solve the first conditions on EKS, we may also support TiDB components mutual authentication using user-defined certificates, like #1714 does.

What's your opinion? @tennix @mightyguava @cofyc

@cofyc
Copy link
Contributor

cofyc commented Feb 21, 2020

I think it is good to support user-specified certificates.

@tennix
Copy link
Member Author

tennix commented Feb 21, 2020

Yes, I suggest using cert-manager to support automatic certificate management which can support different kinds of CA. See #1669

@weekface
Copy link
Contributor

weekface commented Feb 21, 2020

User-defined certificates is different from cert-manager issued certificates.

If we use cert-manager, the CA Private Key must be provided. User-defined certificate doesn't need the CA Private Key, just certificates.

For this EKS problem, we just used user-defined certificate to resolve this problem.

cert-manger needs a lot of refactor, but still cannot using user-defined certificates.

@cofyc
Copy link
Contributor

cofyc commented Feb 21, 2020

we can have a manual way to configure certificate first, then we can use cert-manager to provision certificate and configure it in TidbCluster CRD automatically.

@tennix
Copy link
Member Author

tennix commented Feb 21, 2020

Yes, for first step, we can support user generate certificates manually for the internal mutual TLS. Though the operation is verbose, it can work.

Then I think we need to support cert-manager anyway, because on some platforms users need to automate the procedure and auto rotate certificates before the certificates get expired.

@mightyguava
Copy link
Contributor

We definitely prefer having the operator manage the certs for us than doing manual user-defined certificates. Cert manager seems like a good option too, but in that case we would want to be able to use AWS Private CA rather than providing cert manager with the private key. There's an open PR here cert-manager/cert-manager#2302

In the meanwhile, is it still possible to have a workaround on EKS that allows us to use the Kubernetes CA?

@weekface
Copy link
Contributor

First we need to support user-defined certificates to let EKS work properly. At the same time we should try to find other possible workarounds on EKS.

In the long run, we still need to support cert-manger to issue certificate, if the AWS Private CA feature is ready, we can use it too.

@gregwebs
Copy link
Contributor

EKS might fix this: aws/containers-roadmap#750

@tennix
Copy link
Member Author

tennix commented Mar 19, 2020

We've already fixed this by allowing users provide a TLS secret themselves. The TLS secret can be created by cert-manager or by public CA signed certificates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants