Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

Closed
BornChanger opened this issue Jul 6, 2023 · 4 comments · Fixed by #45361
Closed

ebs br: restore could hang if some tikv nodes are killed or restarted #45206

BornChanger opened this issue Jul 6, 2023 · 4 comments · Fixed by #45361
Assignees
Labels
affects-6.5 affects-7.1 component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.

Comments

@BornChanger
Copy link
Contributor

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

Kill some TiKV node during ebs br restore phase

2. What did you expect to see? (Required)

EBS BR restore continue and succeed

3. What did you see instead (Required)

EBS BR restore hangs

4. What is your TiDB version? (Required)

TiDB 6.5 and above

@BornChanger BornChanger added the type/bug The issue is confirmed as a bug. label Jul 6, 2023
@BornChanger
Copy link
Contributor Author

/assign @YuJuncen

@jebter jebter added severity/critical component/br This issue is related to BR of TiDB. labels Jul 7, 2023
@ti-chi-bot ti-chi-bot bot added may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jul 7, 2023
@jebter jebter added affects-6.5 affects-7.1 and removed may-affects-5.2 This bug maybe affects 5.2.x versions. may-affects-5.3 This bug maybe affects 5.3.x versions. may-affects-5.4 This bug maybe affects 5.4.x versions. may-affects-6.1 may-affects-6.5 may-affects-7.1 labels Jul 7, 2023
@YuJuncen
Copy link
Contributor

This is because when we are in recovery mode, all elections will be suspended until BR choose the leader. But the problem is that AFTER BR had chosen the leader, the store got down. Once it reboots, the leaders are dropped. However we are still in recovery mode, so we cannot elect new leaders.

@YuJuncen
Copy link
Contributor

A solution might be extending the recovery mode. Make it have 3 stages:

  • on: the initial stage, which stops raft election and optimize for flashing back.
  • for_flashback: once BR finished the wait_apply RPC, it will issue a RPC to PD that updating the recovery mode state to for_flashback, but not reboot TiKVs. That means, config of most TiKVs will be in on stage. And rebooted stores can now issue elections.
  • off or unset: the default, using the unchanged config.

Once BR detected there is a TiKV outage (maybe by creating a no-op TCP connection with the gRPC port of each TiKV), BR will:

  • If the current recovery mode state is on, retry the whole procedure. (This might be implemented via exit and let operator to restart it.)
  • If the current recovery mode state is for_flashback, retry from flashback, and operator should reboot all stores, so all stores(If not all stores are rebooted, the rest of stores will reject voting because they believe the old leader's lease hasn't expired) will be able to electing new leaders(Perhaps we also need to resume balance-leader-scheduler at this stage.).
  • If the current recovery mode state is off, do nothing and exit. (We have already successed!)

@YuJuncen
Copy link
Contributor

cc @hicqu , do you have some good ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 component/br This issue is related to BR of TiDB. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants