From b0406712632618fe8bce98513eb327fe7e81a192 Mon Sep 17 00:00:00 2001
From: cfzjywxk <lsswxrxr@163.com>
Date: Mon, 10 Jul 2023 10:27:41 +0800
Subject: [PATCH] doc: rfc for configurable kv timeout (#45093)

ref pingcap/tidb#44771
---
 .../2023-06-30-configurable-kv-timeout.md     | 253 ++++++++++++++++++
 1 file changed, 253 insertions(+)
 create mode 100644 docs/design/2023-06-30-configurable-kv-timeout.md

diff --git a/docs/design/2023-06-30-configurable-kv-timeout.md b/docs/design/2023-06-30-configurable-kv-timeout.md
new file mode 100644
index 0000000000000..c21700be6c5dc
--- /dev/null
+++ b/docs/design/2023-06-30-configurable-kv-timeout.md
@@ -0,0 +1,253 @@
+# Proposal: Configurable KV Timeout
+
+* Authors: [cfzjywxk](https://github.com/cfzjywxk)
+* Tracking issue: TODO
+
+## Motivation & Background
+
+TiKV requests are typically processed quickly, taking only a few milliseconds. However, if 
+there is disk IO or network latency jitter on a specified TiKV node, the duration of these requests may spike 
+to several seconds or more. This usually happens when there is an EBS issue like the EBS is repairing its 
+replica.
+
+Currently, the timeout limit for the TiKV requests is fixed and cannot be adjusted.  For example:
+- `ReadTimeoutShort`: [30 seconds](https://github.com/tikv/client-go/blob/ce9203ef66e99dcc8b14f68777c520830ba99099/internal/client/client.go#L82), which is used for the Get RPC request.
+- `ReadTimeoutMedium`: [60 seconds](https://github.com/tikv/client-go/blob/ce9203ef66e99dcc8b14f68777c520830ba99099/internal/client/client.go#L83), which is used for the BatchGet, Cop RPC request.
+
+The kv-client or TiDB may wait for the responses from TiKV nodes until timeout happens, 
+it could not retry or redirect the requests to other nodes quickly as the timeout configuration is usually 
+tens of seconds. As a result, the application may experience a query latency spike when TiKV node jitter is 
+encountered.
+
+For latency sensitive applications, providing predictable sub-second read latency by set fast retrying
+is valuable. queries that are usually very fast (such as point-select query), setting the value
+of `tidb_kv_read_timeout` to short value like  `500ms`, the TiDB cluster would be more tolerable
+for network latency or io latency jitter for a single storage node, because the retry are more quickly.
+
+This would be helpful if the requests could be processed by more than 1 candidate targets, for example
+follower or stale read requests that could be handled by multiple available peers.
+
+## Improvement Proposal
+
+A possible improvement suggested by [#44771](https://github.com/pingcap/tidb/issues/44771) is to make the
+timeout values of specific KV requests configurable. For example:
+- Adding a session variable `tidb_kv_read_timeout`, which is used to control the timeout for a single 
+TiKV read RPC request. When the user sets the value of this variable, all read RPC request timeouts will use this value. 
+The default value of this variable is 0, and the timeout of TiKV read RPC requests is still the original 
+value of `ReadTimeoutShort` and `ReadTimeoutMedium`.
+- Adding statement level hint like `SELECT /*+ tidb_kv_read_timeout(500ms) */ * FROM t where id = ?;` to
+set the timeout value of the KV requests of this single query to the certain value.
+
+### Example Usage
+
+Consider the stale read usage case:
+
+```SQL
+set @@tidb_read_staleness=-5;
+# The unit is miliseconds. The session variable usage.
+set @@tidb_tikv_tidb_timeout=500;
+select * from t where id = 1;
+# The unit is miliseconds. The query hint usage.
+select /*+ tidb_kv_read_timeout(500ms) */ * FROM t where id = 1;
+```
+
+### Problems
+
+- Setting the variable `tidb_kv_read_timeout ` may not be easy if it affects the timeout for all 
+TiKV read requests, such as Get, BatchGet, Cop in this session.A timeout of 1 second may be sufficient for GET requests, 
+but may be small for COP requests. Some large COP requests may keep timing out and could not be processed properly.
+- If the value of the variable or query hint `tidb_kv_read_timeout` is set too small, more retries will occur, 
+increasing the load pressure on the TiDB cluster. In the worst case the query would not return until the 
+max backoff timeout is reached if the `tidb_kv_read_timeout` is set to a value which none of the replicas
+could finish processing within that time.
+
+
+## Detailed Design
+
+The query hint usage would be more flexible and safer as the impact is limited to a single query.
+
+### Add Hint Support For `tidb_kv_read_timeout`
+
+- Add related field in the `StatementContext` struct like
+```go
+type StmtHints struct {
+	...
+	KVReadTimeout: Duration
+}
+```
+- Support `tidb_kv_read_timeout` processing in `ExtractTableHintsFromStmtNode` and `handleStmtHints` functions, 
+convert the user input hint value into correspond field of `StmtContext` so it could be used later.
+
+### Support Timeout Configuration For Get And Batch-Get
+
+- Add related filed in the `KVSnapshot` struct like
+
+```go
+type KVSnapshot struct {
+	...
+	KVReadTimeout: Duration
+}
+```
+
+This change is to be done in the `client-go` repository.
+
+- Add set interface `SetKVReadTimeout` and get interface `GetKVReadTimeout` for `KVSnapshot`
+```go
+func (s *KVSnapshot) SetKVReadTimeout(KVTimeout Duration) {
+	s.KVReadTimeout = KVTimeout
+}
+```
+
+This change is to be done both in the `client-go` repository and `tidb` repository.
+
+- Call the set interfaces to pass the timeout value from `StmtContext` to the `KVSnapshot` structures, when
+the `PointGet` and `BatchPointGet` executors are built.
+- Pass the timeout value into the `Context.max_execution_duration_ms` field of the RPC `Context` like
+```protobuf
+message Context {
+  ...
+  uint64 max_execution_duration_ms = 14;
+  ...
+}
+```
+Note by now this field is used only by write requests like `Prewrite` and `Commit`.
+
+- Support timeout check during the request handling in TiKV. When there's new point get and batch point get 
+requests are received, the `kv_read_timeout` value should be read from the `GetReuqest` and `BatchGetRequest`
+requests and passed to the `pointGetter`. By now there's no canceling mechanism when the task is scheduled to the
+read pool in TiKV, a simpler way is to check the deadline duration before next processing and try to return
+directly if the deadline is exceeded, but once the read task is spawned to the read pool it could not be 
+canceled anymore.
+```rust
+pub fn get(
+    &self,
+    mut ctx: Context,
+    key: Key,
+    start_ts: TimeStamp,
+) -> impl Future<Output = Result<(Option<Value>, KvGetStatistics)>> {
+    ...
+     let res = self.read_pool.spawn_handle(
+            async move {
+                // TODO: Add timeout check here.
+                let snapshot =
+                    Self::with_tls_engine(|engine| Self::snapshot(engine, snap_ctx)).await?;
+
+                {
+                    // TODO: Add timeout check here.
+                    ...
+                    let result = Self::with_perf_context(CMD, || {
+                        let _guard = sample.observe_cpu();
+                        let snap_store = SnapshotStore::new(
+                            snapshot,
+                            start_ts,
+                            ctx.get_isolation_level(),
+                            !ctx.get_not_fill_cache(),
+                            bypass_locks,
+                            access_locks,
+                            false,
+                        );
+                        // TODO: Add timeout check here.
+                        snap_store
+                        .get(&key, &mut statistics)
+                    });
+                 ...   
+            ...        
+        );
+}
+```
+These changes need to be done in the `tikv` repository.
+
+### Support Timeout Configuration For Coprocessor Tasks
+
+- Add related field in the `copTask` struct like
+```go
+type copTask struct {
+	...
+	KVReadTimeout: Duration
+}
+```
+- Pass the timeout value into the `Context.max_execution_duration_ms` field
+```protobuf
+message Context {
+  ...
+  uint64 max_execution_duration_ms = 14;
+  ...
+}
+```
+
+- Use the `kv_read_timeout` value passed in to calculate the `deadline` result in `parse_and_handle_unary_request`,
+```rust
+fn parse_request_and_check_memory_locks(
+    &self,
+    mut req: coppb::Request,
+    peer: Option<String>,
+    is_streaming: bool,
+) -> Result<(RequestHandlerBuilder<E::Snap>, ReqContext)> {
+    ...
+     req_ctx = ReqContext::new(
+        tag,
+        context,
+        ranges,
+        self.max_handle_duration, // Here use the specified timeout value.
+        peer,
+        Some(is_desc_scan),
+        start_ts.into(),
+        cache_match_version,
+        self.perf_level,
+    );
+}
+```
+This change needs to be done in the `tikv` repository.
+
+### Timeout Retry
+
+- When timeout happens, retry the next available peer for the read requests like stale read and `non-leader-only`
+snapshot read requests. The strategy is:
+  1. Try the next peer if the first request times out immediately.
+  2. If all the responses from all the peers are `timeout` errors, return `timeout` error to the upper layer,
+  3. Backoff with the `timeout` error and retry the requests with **the original default** `ReadShortTimeout`
+  and `ReadMediumTimeout` values.
+  4. Continue to process like before.
+- The timeout retry details should be recorded and handled properly in the request tracker, so are the metrics.
+
+### Timeout Task Canceling
+
+- If the timeout value is configured to a small value like hundreds of milliseconds, there could be more
+timeout tasks. When task is spawned to the read pool in TiKV, it could not be canceled immediately for 
+some reasons:
+  - When the read task is polled and executed, there's no timeout checking mechanism in the task scheduler
+  by now.
+  - The read operations are synchronous, so if the read task is blocked by slow io the task could not be
+  scheduled or canceled.
+- Though some simple timeout checks could be added in certain places like "after getting snapshot". A 
+complete timeout check or scheduling design is still needed in the future. These preconditions are complex and 
+require a lot of work including making the read processing asynchronous in TiKV.
+
+
+### Compatibility
+
+The requests with configurable timeout values would take effect on newer versions. These fields are expected not to take effect when down-grading the
+cluster theoretically.
+
+## Alternative Solutions
+
+### Derived Timeouts
+
+There’s already a query level timeout configuration `max_execution_time` which could be configured using 
+variables and query hints. 
+
+If the timeout of RPC or TiKV requests could derive the timeout values from the `max_execution_time` in an 
+intelligent way, it’s not needed to expose another variable or usage like `tidb_kv_read_timeout`.
+
+For example, consider the strategy:
+- The  `max_execution_time` is configured to `1s` on the query level
+- The first RPC timeout is configured to some proportional value of the `max_execution_time` like `500ms`
+- When first RPC times out the retry RPC is configured to left value of the `max_execution_time` like `400ms`
+
+From the application’s perspective, TiDB is trying to finish the requests ASAP considering the 
+input `max_execution_time` value, and when TiDB could not finish in time the timeout error is returned 
+to the application and the query fails fast.
+
+Though it's easier to use and would not introduce more configurations or variables, it's difficult to make
+the timeout calculation intelligent or automatic, considering things like different request types, 
+node pressures and latencies, etc. A lot of experiments are needed to figure out the appropriate strategy.