Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

br: backup checkpoint #11459

Merged
merged 9 commits into from
Dec 1, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -479,6 +479,7 @@
- BR Features
- [Auto Tune](/br/br-auto-tune.md)
- [Batch Create Table](/br/br-batch-create-table.md)
- [Checkpoint Backup](/br/checkpoint-backup.md)
- References
- [BR Design Principles](/br/backup-and-restore-design.md)
- [BR Command-line](/br/use-br-command-line-tool.md)
Expand Down
40 changes: 40 additions & 0 deletions br/checkpoint-backup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: Checkpoint Backup
summary: Learn about the checkpoint backup feature, including its application scenarios, usage, and implementation details.
---

# Checkpoint Backup <span class="version-mark">New in v6.5.0</span>

Snapshot backup may end in advance due to recoverable errors, such as disk exhaustion and node down. Before TiDB v6.5.0, data that is backed up before the interruption would be invalidated after the error is addressed, and you need to start the backup again. For large clusters, this results in noticeable extra cost.

In TiDB v6.5.0, Backup & Restore (BR) introduces checkpoint backup feature to allow continuing an interrupted backup. This feature is enabled by default. After this feature is enabled, most data of the interrupted backup is retained after an unexpected exit.

## Application scenarios

If your TiDB cluster is large and cannot tolerate backup again after a failure, you can enable the checkpoint backup feature. After this feature is enabled, br command-line tool (hereinafter referred to as `br`) periodically records the shards that have been backed up. In this way, the next backup retry can use the backup progress close to the abnormal exit.

## Usage limitations

During the backup, `br` periodically updates the `gc-safe-point` of the backup snapshot in PD to avoid data being garbage collected. When `br` exits, the `gc-safe-point` cannot be updated in time. As a result, before the next backup retry, the data might have been garbage collected.

To avoid this situation, `br` keeps the `gc-safe-point` for about one hour by default when `gcttl` is not specified. If you need to extend this time, you can set the `gcttl` parameter.

The following example sets `gcttl` to 15 hours to extend the retention period of `gc-safe-point`:

```shell
br backup full \
--storage local:///br_data/ --pd "${PD_IP}:2379" \
--gcttl 54000
```

> **Note:**
>
> `gc-safe-point` created before backup is deleted after the snapshot backup is completed and you do not need to delete it manually.

## Implementation details

During a snapshot backup, `br` encodes the tables into the corresponding key space, and generates backup RPC requests before sending them to TiKV nodes. After receiving the backup request, TiKV nodes back up the data within the requested range. Every time a TiKV node finishes backing up data of a Region, it returns the backup information of this range to `br`.

`br` records the information returned by TiKV nodes, which helps `br` get informed of the key ranges that have been backed up. The checkpoint backup feature periodically uploads the new backup information to external storage so that the key ranges that have been backed up can be persisted.

When `br` retries the backup, it reads the key ranges that have been backed up from external storage, and compares them with the key ranges of the backup task. The differential data helps `br` to determine the data that still needs to be backed up in checkpoint backup.