Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

Merged
merged 4 commits into from
Sep 28, 2018
Merged

domain: fast new a etcd session when the session is stale in the schemaVersionSyncer #7774

merged 4 commits into from
Sep 28, 2018

Conversation

winkyao
Copy link
Contributor

@winkyao winkyao commented Sep 25, 2018

Fix issue that when the pd leader is down, the etcd session will be stale, and tidb may costs several minutes to create a new session to the etcd in some case.

I have done a manual test using ansible, with inventory.ini:

## TiDB Cluster Part
[tidb_servers]
192.168.59.105

[tikv_servers]
192.168.59.105

[pd_servers]
192.168.59.105
192.168.59.103

[spark_master]

[spark_slaves]

## Monitoring Part
# prometheus and pushgateway servers
[monitoring_servers]
192.168.59.105

[grafana_servers]
192.168.59.105

# node_exporter and blackbox_exporter servers
[monitored_servers]
192.168.59.105

[alertmanager_servers]
192.168.59.105

[kafka_exporter_servers]

## Binlog Part
[pump_servers:children]
tidb_servers

[drainer_servers]

## Group variables
[pd_servers:vars]
# location_labels = ["zone","rack","host"]

## Global variables
[all:vars]
deploy_dir = /home/tidb/deploy

## Connection
# ssh via normal user
ansible_user = wink

cluster_name = test-cluster

tidb_version = v2.0.1

# process supervision, [systemd, supervise]
process_supervision = systemd

# timezone of deployment region
timezone = Asia/Shanghai
set_timezone = True

enable_firewalld = False
# check NTP service
enable_ntpd = True
set_hostname = False
# CPU, memory and disk performance will not be checked when dev_mode = True
dev_mode = False

## binlog trigger
enable_binlog = False
# zookeeper address of kafka cluster for binlog, example:
# zookeeper_addrs = "192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181"
zookeeper_addrs = ""
# kafka cluster address for monitoring, example:
# kafka_addrs = "192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092"
kafka_addrs = ""

# store slow query log into seperate file
enable_slow_query_log = False

# enable TLS authentication in the TiDB cluster
enable_tls = False

# KV mode
deploy_without_tidb = False

# Optional: Set if you already have a alertmanager server.
# Format: alertmanager_host:alertmanager_port
alertmanager_target = ""

grafana_admin_user = "admin"
grafana_admin_password = "admin"


### Collect diagnosis
collect_log_recent_hours = 2

enable_bandwidth_limit = True
# default: 10Mb/s, unit: Kbit/s
collect_bandwidth_limit = 10000

There are 2 pd, 1 tidb, 1 tikv in the test cluster, I start a client which continue inserting data into tidb, and I kill -9 the pd leader, we can see the tidb report errors:

2018/09/28 11:53:58.800 domain.go:417: [info] [ddl] reload schema in loop, server info syncer need restart

After the pd recover: 2018/09/28 11:54:59.562 leader.go:269: [info] PD cluster leader pd1 is ready to serve, the etcd session is recovered:

2018/09/28 11:54:59.545 domain.go:419: [info] [ddl] server info syncer restarted.

It costs several millseconds to recover.

@@ -88,7 +90,7 @@ type SchemaSyncer interface {
type schemaVersionSyncer struct {
selfSchemaVerPath string
etcdCli *clientv3.Client
session *concurrency.Session
session unsafe.Pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session is accessed in multi-thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you talk about it in detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadSnapshotInfoSchemaIfNeeded will use the session, and loadSchemaLoop will too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

@winkyao
Copy link
Contributor Author

winkyao commented Sep 28, 2018

@zimulala @crazycs520 @shenli PTAL

Copy link
Contributor

@crazycs520 crazycs520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@winkyao winkyao added priority/release-blocker This issue blocks a release. Please solve it ASAP. component/DDL-need-LGT3 labels Sep 28, 2018
Copy link
Contributor

@ciscoxll ciscoxll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ciscoxll
Copy link
Contributor

@zimulala PTAL.

@winkyao winkyao added the status/LGT2 Indicates that a PR has LGTM 2. label Sep 28, 2018
err := do.mustRestartSyncer()
if err != nil {
log.Errorf("[ddl] reload schema in loop, schema syncer restart err %v", errors.ErrorStack(err))
break
}
do.SchemaValidator.Restart()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add metrics here?

Copy link
Contributor Author

@winkyao winkyao Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already have metrics.NewSessionHistogram and metrics.DeploySyncerHistogram, I think there is no need to add here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

@winkyao
Copy link
Contributor Author

winkyao commented Sep 28, 2018

/run-all-tests

Copy link
Contributor

@zimulala zimulala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -88,7 +90,7 @@ type SchemaSyncer interface {
type schemaVersionSyncer struct {
selfSchemaVerPath string
etcdCli *clientv3.Client
session *concurrency.Session
session unsafe.Pointer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha.

@winkyao winkyao merged commit 6a1e94f into pingcap:master Sep 28, 2018
@winkyao winkyao deleted the fast_failover_etcd branch September 28, 2018 09:36
winkyao added a commit that referenced this pull request Oct 8, 2018
@you06 you06 added the sig/sql-infra SIG: SQL Infra label Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/release-blocker This issue blocks a release. Please solve it ASAP. sig/sql-infra SIG: SQL Infra status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants