Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tiflash: add the maintain.md doc #2189

Merged
merged 13 commits into from
Apr 8, 2020
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -307,6 +307,7 @@
- [Overview](/reference/tiflash/overview.md)
- [Deploy a TiFlash Cluster](/reference/tiflash/deploy.md)
- [Use TiFlash](/reference/tiflash/use-tiflash.md)
- [Maintain a TiFlash Cluster](/reference/tiflash/maintain.md)
+ TiDB Binlog
- [Overview](/reference/tidb-binlog/overview.md)
- [Deploy](/reference/tidb-binlog/deploy.md)
Expand Down
150 changes: 150 additions & 0 deletions reference/tiflash/maintain.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
title: Maintain a TiFlash Cluster
summary: Learn common operations when you maintain a TiFlash cluster.
category: reference
---

# Maintain a TiFlash Cluster

This document describes common operations when you maintain a TiFlash cluster, including checking the version, node logout, troubleshooting, critical logs, and a system table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This document describes common operations when you maintain a TiFlash cluster, including checking the version, node logout, troubleshooting, critical logs, and a system table.
This document describes how to perform common operations when you maintain a TiFlash cluster, including checking the TiFlash version, taking TiFlash nodes down, and troubleshooting TiFlash. This document also introduces critical logs and system tables of TiFlash.


## Check the TiFlash version

There are two ways to check the TiFlash version:

- If the binary file name of TiFlash is `tiflash`, you can check the version by executing the `./tiflash version` command.

However, to execute the above command, you need to add the directory path which includes the `libtiflash_proxy.so` dynamic library to the `LD_LIBRARY_PATH` environment variable. This is because the running of TiFlash relies on the `libtiflash_proxy.so` dynamic library.

For example, when `tiflash` and `libtiflash_proxy.so` are in the same directory, you can first switch to this directory, and then use the following command to check the TiFlash version:

{{< copyable "shell-regular" >}}

```shell
LD_LIBRARY_PATH=./ ./tiflash version
```

- Check the TiFlash version by referring to the TiFlash log. For the log path, see [[logger] in the tiflash.toml configuration file](/reference/tiflash/configuration.md#configuration-file-tiflashtoml). For example:

```
<information>: TiFlash version: TiFlash 0.2.0 master-375035282451103999f3863c691e2fc2
```

## Logout a TiFlash node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Logout a TiFlash node
## Take a TiFlash node down

take nodes down is an expression used more commonly in our technical documents and it has more search results in Google. Please update this expression in other places (if any). @toutdesuite


Logouting a TiFlash node differs from [Scaling in the TiFlash node](/reference/tiflash/scale.md#scale-in-tiflash-node) in that the logout doesn't remove the node from TiDB Ansible; instead, it just safely shutdown the process.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference anchor is not finalized because scale.md is not merged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at L90. The anchor is decided: #scale-in-a-tiflash-node.


Take the following steps to logout a TiFlash node:
yikeke marked this conversation as resolved.
Show resolved Hide resolved

> **Note:**
>
> After you logout the TiFlash node, if the number of the remaining nodes in the TiFlash cluster is greater than or equal to the maximum replicas of all data tables, you can go directly to step 3.

1. For a TiDB server, if the number of replicas of tables is greater than or equal to that of the remaining TiFlash nodes in the cluster, execute the following command:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. For a TiDB server, if the number of replicas of tables is greater than or equal to that of the remaining TiFlash nodes in the cluster, execute the following command:
1. If the number of replicas of tables is greater than or equal to that of the remaining TiFlash nodes in the cluster, execute the following command on those tables in the TiDB client:


{{< copyable "sql" >}}

```sql
alter table <db-name>.<table-name> set tiflash replica 0;
```

2. To ensure TiFlash replicas of related tables are removed, see [View the Table Replication Progress](/reference/tiflash/use-tiflash.md#view-the-table-replication-progress). If you cannot view the replication progress of the related tables, it means that the replicas are removed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. To ensure TiFlash replicas of related tables are removed, see [View the Table Replication Progress](/reference/tiflash/use-tiflash.md#view-the-table-replication-progress). If you cannot view the replication progress of the related tables, it means that the replicas are removed.
2. To ensure that the TiFlash replicas of these tables are removed, see [View the Table Replication Progress](/reference/tiflash/use-tiflash.md#view-the-table-replication-progress). If you cannot view the replication progress of the tables, it means that the replicas are removed.


3. Input the `store` command into [pd-ctl](/reference/tools/pd-control.md) (the binary file in `resources/bin` in the tidb-ansible directory) to view the `store id` of the TiFlash node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. Input the `store` command into [pd-ctl](/reference/tools/pd-control.md) (the binary file in `resources/bin` in the tidb-ansible directory) to view the `store id` of the TiFlash node.
3. Input the `store` command into [pd-ctl](/reference/tools/pd-control.md) (the binary file is in `resources/bin` of the tidb-ansible directory) to view the `store id` of the TiFlash node.


4. Input `store delete <store_id>` into `pd-ctl`. Here `<store_id>` refers to the `store id` in step 3.

5. When the corresponding `store` of the node disappeared, or when `state_name` is changed to `Tomestone`, shutdown the TiFlash process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
5. When the corresponding `store` of the node disappeared, or when `state_name` is changed to `Tomestone`, shutdown the TiFlash process.
5. When the corresponding `store` of the node disappears, or when `state_name` is changed to `Tomestone`, stop the TiFlash process.

In very few cases we use the past tense in our technical documents. Please read the Google Developer Style Guide for reference.


> **Note:**
>
> If you don't cancel all tables replicated to TiFlash before all TiFlash nodes in a cluster stop running, you need to manually delete the replication rule in PD. Or you cannot successfully logout the TiFlash node.
>
> To manually delete the replication rule in PD, send the `DELETE` request `http://<pd_ip>:<pd_port>/pd/api/v1/config/rule/tiflash/<rule_id>`. `rule_id` refers to the `id` of the `rule` to be deleted.

## TiFlash troubleshooting

This section describes some common questions of TiFlash, the reasons, and the solutions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This section describes some common questions of TiFlash, the reasons, and the solutions.
This section describes some commonly encountered issues when using TiFlash, the reasons, and the solutions.


### TiFlash replica is always in an unusable state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### TiFlash replica is always in an unusable state
### TiFlash replica is always unavailable


This is because TiFlash is in the exception status caused by the configuration error or the environment problems. You can take the following steps to identify the problem component:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is because TiFlash is in the exception status caused by the configuration error or the environment problems. You can take the following steps to identify the problem component:
This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component:


1. Check whether PD enables the `Placement Rules` feature (to enable the feature, see the step 2 of [Add a TiFlash component in an existing TiDB Cluster](/reference/tiflash/deploy.md#add-a-TiFlash-component-in-an-existing-TiDB-cluster):
Copy link
Contributor

@yikeke yikeke Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Check whether PD enables the `Placement Rules` feature (to enable the feature, see the step 2 of [Add a TiFlash component in an existing TiDB Cluster](/reference/tiflash/deploy.md#add-a-TiFlash-component-in-an-existing-TiDB-cluster):
1. Check whether PD enables the `Placement Rules` feature (to enable the feature, see the step 2 of [Add a TiFlash component in an existing TiDB Cluster](/reference/tiflash/deploy.md#add-a-TiFlash-component-in-an-existing-TiDB-cluster):

This anchor is also wrong. Please read our link style guide carefully! @toutdesuite


{{< copyable "shell-regular" >}}

```shell
echo 'config show replication' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
```

The expected result is `"enable-placement-rules": "true"`.

2. Check whether the TiFlash process in the operation system is working correctly using `UpTime` of the TiFlash-Summary monitor panel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Check whether the TiFlash process in the operation system is working correctly using `UpTime` of the TiFlash-Summary monitor panel.
2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel.


3. Check whether the TiFlash proxy status is normal through `pd-ctl`.

{{< copyable "shell-regular" >}}

```shell
echo "store" | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
```

If `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`, it refers to the TiFlash proxy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`, it refers to the TiFlash proxy.
The TiFlash proxy's `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`. You can check this information to confirm a TiFlash proxy.

Please review my suggestion. The Chinese is: store.labels 中含有 {"key": "engine", "value": "tiflash"} 信息的为 TiFlash proxy。 @ilovesoup


4. Check whether `pd buddy` can print the logs correctly (the value of `log` in the [flash.flash_cluster] configuration item of the log path, is by default the `tmp` directory configured by the TiFlash configuration file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
4. Check whether `pd buddy` can print the logs correctly (the value of `log` in the [flash.flash_cluster] configuration item of the log path, is by default the `tmp` directory configured by the TiFlash configuration file).
4. Check whether `pd buddy` can correctly print the logs (the log path is the value of `log` in the [flash.flash_cluster] configuration item; the default log path is under the `tmp` directory configured in the TiFlash configuration file).

原文是“查看 pd buddy 是否正常打印日志(日志路径的对应配置项 [flash.flash_cluster] log 设置的值,默认为 TiFlash 配置文件配置的 tmp 目录下)”,完整的说法是不是“查看 pd buddy 是否正常打印日志(日志路径为对应配置项 [flash.flash_cluster] 中 log 设置的值,日志路径默认在 TiFlash 配置文件配置的 tmp 目录下)”? @ilovesoup


5. Check whether the value of `max-replicas` in PD is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash:

{{< copyable "shell-regular" >}}

```shell
echo 'config show replication' | /path/to/pd-ctl -u http://<pd-ip>:<pd-port>
```

Reconfirm the value of `max-replicas`.

6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to TiFlash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to TiFlash.
6. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the `low-space-ratio` parameter), PD cannot schedule data to this TiFlash node.


### TiFlash query time is unstable, and error log prints many `Lock Exception` messages

This is because large amounts of data are written to the cluster, which leads to the situation that the TiFlash query encounters a lock and requires query retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is because large amounts of data are written to the cluster, which leads to the situation that the TiFlash query encounters a lock and requires query retry.
This is because large amounts of data are written to the cluster, which causes that the TiFlash query encounters a lock and requires query retry.


You can set the query timestamp to one second earlier in TiDB (for example, `set @@tidb_snapshot=412881237115666555;`), to reduce the possibility that TiFlash query encounters a lock; thereby mitigating the risk of unstable query time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can set the query timestamp to one second earlier in TiDB (for example, `set @@tidb_snapshot=412881237115666555;`), to reduce the possibility that TiFlash query encounters a lock; thereby mitigating the risk of unstable query time.
You can set the query timestamp to one second earlier in TiDB (for example, `set @@tidb_snapshot=412881237115666555;`). This makes less TiFlash queries encounter a lock and mitigate the risk of unstable query time.


### Partial queries return `Region Unavailable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Partial queries return `Region Unavailable`
### Some queries return the `Region Unavailable` error


If the load pressure in TiFlash is so heavy that TiFlash data replication falls behind. Some queries might return error message `Region Unavailable`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the load pressure in TiFlash is so heavy that TiFlash data replication falls behind. Some queries might return error message `Region Unavailable`.
If the load pressure on TiFlash is too heavy and it causes that TiFlash data replication falls behind, some queries might return the `Region Unavailable` error.


In this case, you can share the pressure by adding TiFlash nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In this case, you can share the pressure by adding TiFlash nodes.
In this case, you can balance the load pressure by adding more TiFlash nodes.


### Data file corruption

Take the following steps to handle the data file corruption:

1. Refer to [Logout a TiFlash node](/reference/tiflash/maintain.md#logout-a-tiflash-node) to logout the corresponding TiFlash node.
2. Delete the related data of the TiFlash node.
3. Redeploy the TiFlash node in the cluster.

## TiFlash critical logs

| Log Information | Log Description |
|---------------|-------------------|
| [ 23 ] <Information> KVStore: Start to persist [region 47, applied: term 6 index 10] | Data starts to be replicated (the number in the square brackets at the start of the log refers to the thread ID |
| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | `Handling DAG request` refers to that TiFlash starts to handle a Coprocessor request |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | `Handling DAG request` refers to that TiFlash starts to handle a Coprocessor request |
| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | Handling DAG request, that is, TiFlash starts to handle a Coprocessor request |

| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | `Handle DAG request done` refers to that TiFlash finishes a Coprocessor request |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | `Handle DAG request done` refers to that TiFlash finishes a Coprocessor request |
| [ 30 ] <Debug> CoprocessorHandler: grpc::Status DB::CoprocessorHandler::execute() | Handling DAG request done, that is, TiFlash finishes handling a Coprocessor request |


You can find the beginning or the end of a Coprocessor request, and then locate the related logs of the Coprocessor request through the thread ID printed at the start of the log.

## TiFlash system table

The column names and their descriptions of the `information_schema.tiflash_replica` system table are as follows:

| Column Name | Description |
|---------------|-----------|
| TABLE_SCHEMA | database name |
| TABLE_NAME | table name |
| TABLE_ID | table ID |
| REPLICA_COUNT | number of TiFlash replicas |
| AVAILABLE | available or not (0/1)|
| PROGRESS | replication progress [0.0~1.0] |