Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Replica Read Timeout with Retry #44771

Open
Tema opened this issue Jun 19, 2023 · 2 comments
Open

Configurable Replica Read Timeout with Retry #44771

Tema opened this issue Jun 19, 2023 · 2 comments
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@Tema
Copy link
Contributor

Tema commented Jun 19, 2023

Configurable Replica Read Timeout with Retry Feature Request

Is your feature request related to a problem? Please describe:
One of common problems running TiDB in the cloud on network attached disks (Amazon EBS, Google PD or Azure managed disks) is temporary elevated disk IO latency. This can happen if a cloud provider storage node fails and goes through a repair procedure. During the repair phase a network attached disk exhibits 100ms or even single digit second latency vs single digit millisecond latency under the normal conditions.

Describe the feature you'd like:
If TiDB customer uses Follower Read or Stale Read feature, it is possible to retry a request initially landed on the TiKV node with network disk exhibiting elevated latency on the other TiKV replica. While retry policy already exists in tikv go-client, the default network timeout is 10s of seconds.

OLTP workload on TiDB could leverage an introduction of system variable tidb_tikv_read_timeout which then could be passed as a context timeout for TiKV requests made by TiDB layer and rely on existing selector logic to retry requests on other replicas. The implementation of this feature needs also to take care of the following:

Describe alternatives you've considered:
TiDB already has a max_execution_timeout system variable, but it is not used as a context deadline in go-client to network calls from TiDB to TiKV. Moreover, if TiKV request takes longer than max_execution_timeout, then the session is marked as killed and retry won’t happen.

Teachability, Documentation, Adoption, Migration Strategy:
The feature would be fully controlled by session variable tidb_tikv_read_timeout.

@Tema Tema added the type/feature-request Categorizes issue or PR as related to a new feature. label Jun 19, 2023
ti-chi-bot bot pushed a commit that referenced this issue Jul 10, 2023
@cfzjywxk
Copy link
Contributor

@easonn7
Copy link

easonn7 commented Aug 4, 2023

Controlling the timeout behavior of tikv-client is reasonable and requires such a parameter. However, the newly added variable overlaps with the existing variable "tidb_load_based_replica_read_threshold". I personally suggest keeping only tidb_tikv_read_timeout and gradually deprecating the tidb_load_based_replica_read_threshold variable in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants