Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully stop tidb pod #2597

Closed
tennix opened this issue Jun 1, 2020 · 5 comments · Fixed by #2810
Closed

Gracefully stop tidb pod #2597

tennix opened this issue Jun 1, 2020 · 5 comments · Fixed by #2810
Assignees
Labels
enhancement New feature or request priority:P2 status/help-wanted Extra attention is needed
Milestone

Comments

@tennix
Copy link
Member

tennix commented Jun 1, 2020

Feature Request

When stopping a tidb pod, kubelet will first send TERM signal to tidb-server, tidb-server will wait for 15s before closing all SQL connections and then do all the cleanup. If it fails to finish all the cleanup, then kubelet will send KILL signal to tidb-server, and tidb-server will exit immediately.

When tidb-server fails to respond to the client, the client will timeout or fail, thus increasing latency.

We might consider using a readiness probe to let the load balancer (internal or external) to remove the backend first before making the server fail to respond to the client. This would help to reduce the latency during tidb pods shutdown.

The key point is letting readiness probe fail but still be able to receive requests.

ref: https://github.com/pingcap/tidb/blob/v4.0.1/server/server.go#L553-L557

@tennix tennix added the enhancement New feature or request label Jun 1, 2020
@zjj2wry
Copy link
Contributor

zjj2wry commented Jun 1, 2020

maybe we can add a prestop lifeclycle to guarantee tidb-server have enough time to closing all SQL connections gracefully

lifeCycle:
    preStop:
      exec:
        command: ["/bin/bash", "-c", "sleep 15"]

@cofyc cofyc added the status/help-wanted Extra attention is needed label Jun 8, 2020
@DanielZhangQD DanielZhangQD added this to the v1.1.2 milestone Jun 16, 2020
@cofyc
Copy link
Contributor

cofyc commented Jun 16, 2020

maybe we can add a prestop lifeclycle to guarantee tidb-server have enough time to closing all SQL connections gracefully

lifeCycle:
    preStop:
      exec:
        command: ["/bin/bash", "-c", "sleep 15"]

It seems a good solution. In addition to builtin 15 seconds wait time, users can configure extra wait time in preStop hook.

The key point is letting readiness probe fail but still be able to receive requests.

When a pod is going to be deleted (DeletionTimestamp != nil), its IP is ignored in endpoints controller and will be completely removed from endpoints.

@weekface
Copy link
Contributor

weekface commented Jun 22, 2020

When we start to update TiDB, the StatefulSet controller sends command to delete Pod. The pod is removed from endpoints list for service, and are no longer considered part of the set of running Pods for replication controllers. Then the LB(internal or external) can remove the backend.

Next, we should wait for the connections to become 0 and then stop the pod:

  • Set TiDB Pod's spec.terminationGracePeriodSeconds to be large enough (for example, one hour or configurable);
  • Configure the preStop hook script:
    1. Send SIGQUIT signal to TiDB pod, then the TiDB process will go graceful shutting down to wait for all clients to close the connection;
    2. Get the current connections from TiDB's /status API, wait it reach 0.

/status API of TiDB has current connections:

$ curl 10.233.123.121:10080/status
{"connections":1,"version":"5.7.25-TiDB-v4.0.0-rc-141-g7267747ae","git_hash":"7267747ae0ec624dffc3fdedb00f1ed36e10284b"}

@cofyc
Copy link
Contributor

cofyc commented Jun 22, 2020

maybe we can just allow users to specify container lifecycle hooks for our components

we can provide an example (and docs, etc.) for this scenario, but don't limit the possibilities, as if the application tolerate connection failures, they may don't need extra wait time or don't need to wait for all connections to be closed

@weekface weekface self-assigned this Jun 22, 2020
@DanielZhangQD
Copy link
Contributor

container lifecycle hooks

Agree.

For the SIGQUIT signal, TiDB should handle the SIGTERM in the same way, pingcap/tidb#13369 was created long time ago but seems no action from TiDB.

BTW, for spec.terminationGracePeriodSeconds, we cannot set it too long, otherwise, it may block the upgrade for too much time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority:P2 status/help-wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants