Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

20x performance regression going from v6.5.2 to v6.5.3 on K8s #44715

Open
emchristiansen opened this issue Jun 15, 2023 · 5 comments
Open

20x performance regression going from v6.5.2 to v6.5.3 on K8s #44715

emchristiansen opened this issue Jun 15, 2023 · 5 comments
Labels
type/question The issue belongs to a question.

Comments

@emchristiansen
Copy link

Bug Report

I'm using TiDB, installed on K8s using the v1.4.4 operator, without much customization (I basically followed the guides).

When I upgraded to v6.5.3 today I immediately noticed a 20x slowdown in my DB-heavy workloads.
Downgrading to v6.5.2 fixed the issue.

Peculiarities with my setup:

  1. I'm running on top of a Tailscale virtual network.
  2. I created the K8s cluster using K0s, with Calico for networking.
  3. I have one PD, KV, and DB per region and I force reads to be local by using set global tidb_replica_read = 'closest-replicas';.

1. Minimal reproduce step (Required)

I don't have a minimal case.

2. What did you expect to see? (Required)

My particular workload should have a sustained throughput of ~7.5e3 QPS / region / worker, and when I upgraded to 6.5.3 it dropped to ~3e2.

@emchristiansen emchristiansen added the type/bug The issue is confirmed as a bug. label Jun 15, 2023
@seiya-annie seiya-annie added the sig/execution SIG execution label Jun 18, 2023
@Yui-Song
Copy link
Contributor

Yui-Song commented Jun 19, 2023

@emchristiansen, could you please collect the necessary diagnostic data and upload them to clinic and put the Download URL here for us? It should include 2 time periods:

  1. Run your workload with v6.5.2
  2. Run your workload with v6.5.3

@zhangjinpeng87
Copy link
Contributor

@emchristiansen do you use stale read in your case?

@emchristiansen
Copy link
Author

emchristiansen commented Jun 21, 2023 via email

@cfzjywxk cfzjywxk added type/question The issue belongs to a question. and removed type/bug The issue is confirmed as a bug. sig/execution SIG execution labels Jun 27, 2023
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Jun 27, 2023

@emchristiansen

I have one PD, KV, and DB per region

Will the latency of the cross-region be high here? How much is it compared with the local region?
One of the possible reasons is that after v6.5.3 the stale read would be retried on the leader directly if the dataIsNotReady error is returned to the tidb-server. It error would be met as the default value of advance-ts-interval is 20s, so almost all of the requests would be retried on the leaders.

To resolve the above possible issue, the advance-ts-interval need to be configured to a value smaller than your tidb_read_staleness to avoid the retry. For example, if the tidb_read_staleness is set to 5s, the advance-ts-interval need to be set to a smaller value like 2s or 1s.

@you06
Copy link
Contributor

you06 commented Jun 27, 2023

@emchristiansen

Can you checkout the TiDB / KV Request / Stale Read OPS panel in grafana? From the hit/miss count, you can calculate the stale read hit rate, usually change advance-ts-interval to half of your staleness will achieve good hit rate.

BTW can you share the staleness of your workload with us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question The issue belongs to a question.
Projects
None yet
Development

No branches or pull requests

6 participants