Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ebs_br: allow temporary TiKV unreachable during starting snapshot backup #49154

Merged
merged 33 commits into from
Jan 15, 2024

Conversation

YuJuncen
Copy link
Contributor

@YuJuncen YuJuncen commented Dec 4, 2023

What problem does this PR solve?

Issue Number: close #49152, close #49153

Problem Summary:
See the issue.
For #49152, we didn't add retry for starting suspending lightning.
For #49153, we just break the loop when keeper encounters errors, this may cause the final consistency check passes because of the request of extend lease.

What changed and how does it work?

Fixed the problems above by retry and fail fast.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Fixed a bug that may cause EBS snapshot backup not work properly with TiKV outage.

Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 4, 2023
Copy link

tiprow bot commented Dec 4, 2023

Hi @YuJuncen. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. and removed do-not-merge/needs-triage-completed labels Dec 5, 2023
@nkg-
Copy link
Contributor

nkg- commented Dec 6, 2023

With this change, what is the maximum time, evict leader scheduler (and other schedulers) can remain paused. If a tikv is already restarted, then init pod will wait for it come up, and then suspend lightning. And during this time, all schedulers/gc/suspend, will be paused right.

@YuJuncen YuJuncen changed the title snap_br: allow temporary TiKV unreachable during starting snapshot backup ebs_br: allow temporary TiKV unreachable during starting snapshot backup Dec 6, 2023
@YuJuncen
Copy link
Contributor Author

YuJuncen commented Dec 6, 2023

With this change, what is the maximum time, evict leader scheduler (and other schedulers) can remain paused.

If every requests to suspend lightning failed immediately, we will keep retry for about 10 mins. For some call that stuck, we may cost more time over it. If the GC stop time is essential, perhaps we can make the retry based on time cost over failed requests instead of failure count.

If a tikv is already restarted, then init pod will wait for it come up, and then suspend lightning. And during this time, all schedulers/gc/suspend, will be paused right.

Yes.

@BornChanger
Copy link
Contributor

/retesst

Signed-off-by: hillium <yujuncen@pingcap.com>
Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Merging #49154 (31e2fce) into master (695d162) will decrease coverage by 18.0745%.
Report is 32 commits behind head on master.
The diff coverage is 78.2258%.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #49154         +/-   ##
=================================================
- Coverage   71.8223%   53.7478%   -18.0745%     
=================================================
  Files          1444       1549        +105     
  Lines        346984     583242     +236258     
=================================================
+ Hits         249212     313480      +64268     
- Misses        77425     245816     +168391     
- Partials      20347      23946       +3599     
Flag Coverage Δ
integration 20.9109% <67.7419%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 54.0269% <ø> (-2.2860%) ⬇️
parser ∅ <ø> (∅)
br 55.5410% <78.2258%> (+4.2402%) ⬆️

@nkg-
Copy link
Contributor

nkg- commented Dec 6, 2023

If every requests to suspend lightning failed immediately, we will keep retry for about 10 mins. For some call that stuck, we may cost more time over it. If the GC stop time is essential, perhaps we can make the retry based on time cost over failed requests instead of failure count.

Yeah. Infact, can we make the max pause (gc/schedulers/import) duration configurable. We don't want to pause them more than X (lets say 10 mins). And if during that time, if we cannot pause all tikvs, then its ok to fail the backup. But yeah, retry based on time limit will be ideal. But for implementation, its ok to use retries with exponential backup, and break after a certain time.

A bit outside the scope of this PR. But do we have retries around ebs snapshot trigger (done within backup pod). If create-snapshot api gets throttled, whats the max retry time. Asking since during time, the init pod (and hence pause) will be active. Ok taking this discussion offline (on slack).

Signed-off-by: hillium <yujuncen@pingcap.com>
@BornChanger
Copy link
Contributor

/retest

Copy link

tiprow bot commented Dec 6, 2023

@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@BornChanger
Copy link
Contributor

/retest-required

Copy link

tiprow bot commented Dec 6, 2023

@BornChanger: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@YuJuncen
Copy link
Contributor Author

YuJuncen commented Dec 6, 2023

/test check-dev

Copy link

tiprow bot commented Dec 6, 2023

@YuJuncen: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test check-dev

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: hillium <yujuncen@pingcap.com>
@YuJuncen
Copy link
Contributor Author

/retest-required

Signed-off-by: hillium <yujuncen@pingcap.com>
Signed-off-by: hillium <yujuncen@pingcap.com>
@ti-chi-bot ti-chi-bot bot merged commit ac71239 into pingcap:master Jan 15, 2024
25 checks passed
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #50442.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #50443.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #50444.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 15, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot pushed a commit that referenced this pull request Jan 17, 2024
@ti-chi-bot ti-chi-bot removed the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Feb 5, 2024
ti-chi-bot bot pushed a commit that referenced this pull request Feb 21, 2024
@ti-chi-bot ti-chi-bot removed the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Feb 23, 2024
guoshouyan pushed a commit to guoshouyan/tidb that referenced this pull request Mar 5, 2024
…kup (pingcap#49154) (pingcap#50444) (pingcap#37)

close pingcap#49152, close pingcap#49153

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@BornChanger BornChanger added needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. labels Apr 12, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request could not be created: failed to create pull request against pingcap/tidb#release-7.1 from head ti-chi-bot:cherry-pick-49154-to-release-7.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-49154-to-release-7.1."}],"documentation_url":"https://docs.github.com/rest/pulls/pulls#create-a-pull-request"}

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 12, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Apr 12, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #52568.

@ti-chi-bot ti-chi-bot removed the needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. label Apr 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
5 participants