domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

winkyao · 2018-09-25T03:18:23Z

Fix issue that when the pd leader is down, the etcd session will be stale, and tidb may costs several minutes to create a new session to the etcd in some case.

I have done a manual test using ansible, with inventory.ini:

## TiDB Cluster Part
[tidb_servers]
192.168.59.105

[tikv_servers]
192.168.59.105

[pd_servers]
192.168.59.105
192.168.59.103

[spark_master]

[spark_slaves]

## Monitoring Part
# prometheus and pushgateway servers
[monitoring_servers]
192.168.59.105

[grafana_servers]
192.168.59.105

# node_exporter and blackbox_exporter servers
[monitored_servers]
192.168.59.105

[alertmanager_servers]
192.168.59.105

[kafka_exporter_servers]

## Binlog Part
[pump_servers:children]
tidb_servers

[drainer_servers]

## Group variables
[pd_servers:vars]
# location_labels = ["zone","rack","host"]

## Global variables
[all:vars]
deploy_dir = /home/tidb/deploy

## Connection
# ssh via normal user
ansible_user = wink

cluster_name = test-cluster

tidb_version = v2.0.1

# process supervision, [systemd, supervise]
process_supervision = systemd

# timezone of deployment region
timezone = Asia/Shanghai
set_timezone = True

enable_firewalld = False
# check NTP service
enable_ntpd = True
set_hostname = False
# CPU, memory and disk performance will not be checked when dev_mode = True
dev_mode = False

## binlog trigger
enable_binlog = False
# zookeeper address of kafka cluster for binlog, example:
# zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"
zookeeper_addrs = ""
# kafka cluster address for monitoring, example:
# kafka_addrs = "192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092"
kafka_addrs = ""

# store slow query log into seperate file
enable_slow_query_log = False

# enable TLS authentication in the TiDB cluster
enable_tls = False

# KV mode
deploy_without_tidb = False

# Optional: Set if you already have a alertmanager server.
# Format: alertmanager_host:alertmanager_port
alertmanager_target = ""

grafana_admin_user = "admin"
grafana_admin_password = "admin"


### Collect diagnosis
collect_log_recent_hours = 2

enable_bandwidth_limit = True
# default: 10Mb/s, unit: Kbit/s
collect_bandwidth_limit = 10000

There are 2 pd, 1 tidb, 1 tikv in the test cluster, I start a client which continue inserting data into tidb, and I kill -9 the pd leader, we can see the tidb report errors:

2018/09/28 11:53:58.800 domain.go:417: [info] [ddl] reload schema in loop, server info syncer need restart

After the pd recover: 2018/09/28 11:54:59.562 leader.go:269: [info] PD cluster leader pd1 is ready to serve, the etcd session is recovered:

2018/09/28 11:54:59.545 domain.go:419: [info] [ddl] server info syncer restarted.

It costs several millseconds to recover.

…maVersionSyncer

crazycs520 · 2018-09-26T02:25:29Z

ddl/syncer.go

@@ -88,7 +90,7 @@ type SchemaSyncer interface {
 type schemaVersionSyncer struct {
 	selfSchemaVerPath string
 	etcdCli           *clientv3.Client
-	session           *concurrency.Session
+	session           unsafe.Pointer


Why change this?

session is accessed in multi-thread.

Can you talk about it in detail?

loadSnapshotInfoSchemaIfNeeded will use the session, and loadSchemaLoop will too.

winkyao · 2018-09-28T03:34:49Z

@zimulala @crazycs520 @shenli PTAL

crazycs520

LGTM

ciscoxll

LGTM

ciscoxll · 2018-09-28T07:27:40Z

@zimulala PTAL.

zimulala · 2018-09-28T08:36:01Z

domain/domain.go

 			err := do.mustRestartSyncer()
 			if err != nil {
 				log.Errorf("[ddl] reload schema in loop, schema syncer restart err %v", errors.ErrorStack(err))
 				break
 			}
-			do.SchemaValidator.Restart()


Need to add metrics here?

we already have metrics.NewSessionHistogram and metrics.DeploySyncerHistogram, I think there is no need to add here.

winkyao · 2018-09-28T08:45:39Z

/run-all-tests

zimulala

LGTM

zimulala · 2018-09-28T09:35:35Z

ddl/syncer.go

@@ -88,7 +90,7 @@ type SchemaSyncer interface {
 type schemaVersionSyncer struct {
 	selfSchemaVerPath string
 	etcdCli           *clientv3.Client
-	session           *concurrency.Session
+	session           unsafe.Pointer


…maVersionSyncer #7774 (#7810)

domain: fast new a etcd session when the session is stale in the sche…

58fcff5

…maVersionSyncer

winkyao added the status/DNM label Sep 25, 2018

crazycs520 reviewed Sep 26, 2018

View reviewed changes

remove new session timeout

deaf6f7

winkyao removed the status/DNM label Sep 28, 2018

Merge remote-tracking branch 'upstream/master' into fast_failover_etcd

ed072c6

add log when info syncer is recover

0d13ac6

crazycs520 reviewed Sep 28, 2018

View reviewed changes

winkyao added priority/release-blocker This issue blocks a release. Please solve it ASAP. component/DDL-need-LGT3 labels Sep 28, 2018

ciscoxll reviewed Sep 28, 2018

View reviewed changes

winkyao added the status/LGT2 Indicates that a PR has LGTM 2. label Sep 28, 2018

zimulala reviewed Sep 28, 2018

View reviewed changes

winkyao added the status/all tests passed label Sep 28, 2018

zimulala approved these changes Sep 28, 2018

View reviewed changes

winkyao merged commit 6a1e94f into pingcap:master Sep 28, 2018

winkyao deleted the fast_failover_etcd branch September 28, 2018 09:36

winkyao mentioned this pull request Sep 28, 2018

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774 #7810

Merged

winkyao added a commit that referenced this pull request Oct 8, 2018

domain: fast new a etcd session when the session is stale in the sche…

0c8f98e

…maVersionSyncer #7774 (#7810)

you06 added the sig/sql-infra SIG: SQL Infra label Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

winkyao commented Sep 25, 2018 •

edited

Loading

crazycs520 Sep 26, 2018

winkyao Sep 26, 2018

zimulala Sep 28, 2018

winkyao Sep 28, 2018

zimulala Sep 28, 2018

winkyao commented Sep 28, 2018

crazycs520 left a comment

ciscoxll left a comment

ciscoxll commented Sep 28, 2018

zimulala Sep 28, 2018

winkyao Sep 28, 2018 •

edited

Loading

zimulala Sep 28, 2018

winkyao commented Sep 28, 2018

zimulala left a comment

zimulala Sep 28, 2018

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

Conversation

winkyao commented Sep 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winkyao commented Sep 28, 2018

crazycs520 left a comment

Choose a reason for hiding this comment

ciscoxll left a comment

Choose a reason for hiding this comment

ciscoxll commented Sep 28, 2018

Choose a reason for hiding this comment

winkyao Sep 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winkyao commented Sep 28, 2018

zimulala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

winkyao commented Sep 25, 2018 •

edited

Loading

winkyao Sep 28, 2018 •

edited

Loading