tools: add TiDB-Binlog monitoring metrics (#823)

* tools: add TiDB-Binlog monitoring metrics Via: pingcap/docs-cn#1052 * tools: update wording * tools: add a download tip about Pump and Drainer * tools: update wording
pingcap · Dec 24, 2018 · a78f3b1 · a78f3b1
1 parent 2e648d1
commit a78f3b1
Show file tree

Hide file tree

Showing 2 changed files with 70 additions and 27 deletions.
diff --git a/tools/tidb-binlog-cluster.md b/tools/tidb-binlog-cluster.md
@@ -265,19 +265,23 @@ It is recommended to deploy TiDB-Binlog using TiDB-Ansible. If you just want to
     $ ansible-playbook start_drainer.yml
     ```
 
-### Deploy TiDB-Binlog using Binary 
+### Deploy TiDB-Binlog using Binary
 
 #### Download the official Binary
 
+Run the following command to download the binary:
+
 ```bash
-# TiDB（v2.0.8-binlog, v2.1.0-rc.5 or the later version）
 wget https://download.pingcap.org/tidb-{version}-linux-amd64.tar.gz
 wget https://download.pingcap.org/tidb-{version}-linux-amd64.sha256
 
 # Check the file integrity. If the result is OK, the file is correct.
 sha256sum -c tidb-{version}-binlog-linux-amd64.sha256
+```
 
-# Pump && Drainer (cluster-latest, v2.1.0-rc.5 or the later version)
+For TiDB v2.1.0 GA or later versions, Pump and Drainer are already included in the TiDB download package. For other TiDB versions, you need to download Pump and Drainer separately using the following command:
+
+```bash
 wget https://download.pingcap.org/tidb-binlog-{version}-linux-amd64.tar.gz
 wget https://download.pingcap.org/tidb-binlog-{version}-linux-amd64.sha256
 

diff --git a/tools/tidb-binlog-monitor.md b/tools/tidb-binlog-monitor.md
@@ -1,78 +1,117 @@
 ---
-title: TiDB-Binlog Monitoring Metrics
-summary: Learn about three levels of monitoring metrics of TiDB-Binlog.
+title: TiDB-Binlog Monitoring Metrics and Alert Rules
+summary: Learn about three levels of monitoring metrics and alert rules of TiDB-Binlog.
 category: tools
 ---
 
-# TiDB-Binlog Monitoring Metrics
+# TiDB-Binlog Monitoring Metrics and Alert Rules
 
-Currently, the monitoring metrics of TiDB-Binlog has three levels:
+This document describes TiDB-Binlog monitoring metrics in Grafana and explains the alert rules.
 
-- Emergency
-- Critical
-- Warning
+## Monitoring metrics
 
-## Emergency
+TiDB-Binlog consists of two components: Pump and Drainer. This section shows the monitoring metrics of Pump and Drainer.
 
-### binlog_pump_storage_error_count
+### Pump monitoring metrics
+
+To understand the Pump monitoring metrics, check the following table:
+
+| Pump monitoring metrics | Description |
+|:---|:---|
+| Storage Size | Records the total disk space (capacity) and the available disk space (available)|
+| Metadata | Records the biggest TSO (`gc_tso`) of the binlog that each Pump node can delete, and the biggest commit TSO (`max_commit_tso`) of the saved binlog |
+| Write Binlog QPS by Instance | Shows QPS of writing binlog requests received by each Pump node |
+| Write Binlog Latency | Records the latency time of each Pump node writing binlog |
+| Storage Write Binlog Size | Shows the size of the binlog data written by Pump |
+| Storage Write Binlog Latency | Records the latency time of the Pump storage module writing binlog |
+| Pump Storage Error By Type | Records the number of errors encountered by Pump, counted based on the type of error |
+| Query TiKV | The number of times that Pump queries the transaction status through TiKV |
+
+### Drainer monitoring metrics
+
+To understand the Drainer monitoring metrics, check the following table:
+
+| Drainer monitoring metrics | Description |
+|:---|:---|
+| Checkpoint TSO | Shows the biggest TSO time of the binlog that Drainer has already synchronized into the downstream. You can get the lag by using the current time to subtract the binlog timestamp. But be noted that the timestamp is allocated by PD of the master cluster and is determined by the time of PD.|
+| Pump Handle TSO | Records the biggest TSO time among the binlog files that Drainer obtains from each Pump node |
+| Pull Binlog QPS by Pump NodeID | Shows the QPS when Drainer obtains binlog from each Pump node |
+| 95% Binlog Reach Duration By Pump | Records the delay from the time when binlog is written into Pump to the time when the binlog is obtained by Drainer |
+| Error By Type | Shows the number of errors encountered by Drainer, counted based on the type of error |
+| Drainer Event | Shows the number of various types of events, including "ddl", "insert", "delete", "update", "flush", and "savepoint" |
+| Execute Time | Records the time it takes to execute the SQL statement in the downstream, or the time it takes to write data into downstream |
+| 95% Binlog Size | Shows the size of the binlog data that Drainer obtains from each Pump node |
+| DDL Job Count | Records the number of DDL statements handled by Drainer|
+
+## Alert rules
+
+Currently, TiDB-Binlog monitoring metrics are divided into the following three types based on the level of importance:
+
+- [Emergency](#emergency)
+- [Critical](#critical)
+- [Warning](#warning)
+
+### Emergency
+
+#### binlog_pump_storage_error_count
 
 - Description: Pump fails to write the binlog data to the local storage
 - Monitoring rule: `changes(binlog_pump_storage_error_count[1m])` > 0
 - Solution: Check whether an error exists in the `pump_storage_error` monitoring and check the Pump log to find the causes
 
-## Critical
+### Critical
 
-### binlog_drainer_checkpoint_high_delay
+#### binlog_drainer_checkpoint_high_delay
 
 - Description: The delay of Drainer synchronization exceeds one hour
 - Monitoring rule: `(time() - binlog_drainer_checkpoint_tso / 1000)` > 3600
 - Solutions:
 
     - Check whether it is too slow to obtain the data from Pump:
-        
-        You can check `handle tso` of Pump to get the time for the latest message of each Pump. Check whether a high latency exists for Pump and make sure the corresponding Pump is running normally 
-    
+
+        You can check `handle tso` of Pump to get the time for the latest message of each Pump. Check whether a high latency exists for Pump and make sure the corresponding Pump is running normally
+
     - Check whether it is too slow to synchronize data in the downstream based on Drainer `event` and Drainer `execute latency`:
-        
+
         - If Drainer `execute time` is too large, check the network bandwidth and latency between the machine with Drainer deployed and the machine with the target database deployed, and the state of the target database
         - If Drainer `execute time` is not too large and Drainer `event` is too small, add `work count` and `batch` and retry
 
     - If the two solutions above cannot work, contact [support@pingcap.com](mailto:support@pingcap.com)
 
-## Warning
+### Warning
 
-### binlog_pump_write_binlog_rpc_duration_seconds_bucket
+#### binlog_pump_write_binlog_rpc_duration_seconds_bucket
 
 - Description: It takes too much time for Pump to handle the TiDB request of writing binlog
 - Monitoring rule: `histogram_quantile(0.9, rate(binlog_pump_rpc_duration_seconds_bucket{method="WriteBinlog"}[5m]))` > 1
-- Solution: 
-    
+- Solution:
+
     - Verify the disk performance pressure and check the disk performance monitoring via `node exported`
     - If both `disk latency` and `util` are low, contact [support@pingcap.com](mailto:support@pingcap.com)
 
-### binlog_pump_storage_write_binlog_duration_time_bucket
+#### binlog_pump_storage_write_binlog_duration_time_bucket
 
 - Description: The time it takes for Pump to write the local binlog to the local disk
 - Monitoring rule: `histogram_quantile(0.9, rate(binlog_pump_storage_write_binlog_duration_time_bucket{type="batch"}[5m]))` > 1
 - Solution: Check the state of the local disk of Pump and fix the problem
 
-### binlog_pump_storage_available_size_less_than_20G
+#### binlog_pump_storage_available_size_less_than_20G
 
 - Description: The available disk space of Pump is less than 20G
 - Monitoring rule: `binlog_pump_storage_storage_size_bytes{type="available"}` < 20 * 1024 * 1024 * 1024
 - Solution: Check whether Pump `gc_tso` is normal. If not, adjust the GC time configuration of Pump or get the corresponding Pump offline
 
-### binlog_drainer_checkpoint_tso_no_change_for_1m
+#### binlog_drainer_checkpoint_tso_no_change_for_1m
 
 - Description: Drainer `checkpoint` has not been updated for one minute
 - Monitoring rule: `changes(binlog_drainer_checkpoint_tso[1m])` < 1
 - Solution: Check whether all the Pumps that are not offline are running normally
 
-### binlog_drainer_execute_duration_time_more_than_10s
+#### binlog_drainer_execute_duration_time_more_than_10s
 
 - Description: The transaction time it takes Drainer to synchronize data to TiDB. If it is too large, the Drainer synchronization of data is affected
 - Monitoring rule: `histogram_quantile(0.9, rate(binlog_drainer_execute_duration_time_bucket[1m]))` > 10
 - Solutions:
-    
+
     - Check the TiDB cluster state
     - Check the Drainer log or monitor. If a DDL operation causes this problem, you can ignore it