Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools: add description for monitor #1052

Merged
merged 9 commits into from
Dec 24, 2018
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 89 additions & 11 deletions tools/tidb-binlog-monitor.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,99 @@ title: TiDB-Binlog 监控指标说明
category: tools
---

# TiDB-Binlog 监控指标说明
# TiDB-Binlog 监控指标及告警说明

本文档介绍 grafana 中 TiDB-Binlog 的各项监控指标说明,以及报警规则说明。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

## 监控指标

### Pump

#### Storage Size

- 记录磁盘的总空间大小(capacity),以及可用磁盘空间大小(avaliable)。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Metadata

- 记录每个 Pump 的可删除 binlog 的最大 tso(gc_tso),以及保存的 binlog 的最大的 commit tso(max_commit_tso)。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Write Binlog QPS by Instance

- 每个 Pump 接收到的写 binlog 请求的 QPS。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Write Binlog Latency

- 记录每个 Pump 写 binlog 的延迟时间。

#### Storage Write Binlog Size

- Pump 写 Binlog 数据的大小。

#### Storage Write Binlog Latency

- Pump 写 Binlog 数据的延迟。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Pump Storage Error By Type

- Pump 遇到的 error 数量,是按照 error 的类型进行统计的。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Query TiKV

- Pump 到 TiKV 查询事务状态的次数。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

### Drainer

#### Checkpoint TSO

- Drainer 已经同步到下游的最大 binlog 的 TSO 对应的时间。可以通过该指标估算同步延迟时间。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Pump Handle TSO

- 记录 Drainer 从各个 Pump 获取到的 binlog 的最大 TSO 对应的时间。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### Pull Binlog QPS by Pump NodeID

- Drainer 从每个 Pump 获取 binlog 的 QPS。

#### 95% Binlog Reach Duration By Pump

- 记录 binlog 从写入 Pump 到被 Drainer 获取到这个过程的延迟时间。

#### Error By Type

- Drainer 遇到的 error 数量,按照 error 的类型进行统计。

#### Drainer Event

- 各种类型 event 的数量,event 包括 ddl、insert、delete、update、flush、savepoint。

#### Execute Time

- SQL 执行到下游/数据写到下游所消耗的时间。
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved

#### 95% Binlog Size

- Drainer 从各个 Pump 获取到 binlog 数据的大小。

#### DDL Job Count

- Drainer 处理的 DDL 的数量。

## 监控告警规则

目前对 TiDB-Binlog 中一些比较重要的方面配置了监控,根据指标的重要程度分为 Emergency、Critical 和 Warning 三种级别。

## Emergency
### Emergency

### binlog_pump_storage_error_count
#### binlog_pump_storage_error_count

- 含义:Pump 写 binlog 到本地存储时失败
- 监控规则:changes(binlog_pump_storage_error_count[1m]) > 0
- 处理方法:先确认 pump_storage_error 监控是否存在错误,查看 Pump 日志确认原因

## Critical
### Critical

### binlog_drainer_checkpoint_high_delay
#### binlog_drainer_checkpoint_high_delay

- 含义:Drainer 同步落后延迟超过 1 个小时
- 监控规则:(time() - binlog_drainer_checkpoint_tso / 1000) > 3600
Expand All @@ -34,9 +112,9 @@ category: tools

- 上面都不满足或者操作后没有改观,则报备开发 support@pingcap.com 进行处理

## Warning
### Warning

### binlog_pump_write_binlog_rpc_duration_seconds_bucket
#### binlog_pump_write_binlog_rpc_duration_seconds_bucket

- 含义:Pump 处理 TiDB 写 Binlog 请求耗时过大
- 监控规则:histogram_quantile(0.9, rate(binlog_pump_rpc_duration_seconds_bucket{method="WriteBinlog"}[5m])) > 1
Expand All @@ -45,25 +123,25 @@ category: tools
- 确认磁盘性能压力,通过 node exported 查看 disk performance 监控
- 如果 disk latency 和 util 都很低,那么报备研发 support@pingcap.com 处理

### binlog_pump_storage_write_binlog_duration_time_bucket
#### binlog_pump_storage_write_binlog_duration_time_bucket

- 含义:Pump 写本地 binlog 到本地盘的耗时
- 监控规则:histogram_quantile(0.9, rate(binlog_pump_storage_write_binlog_duration_time_bucket{type="batch"}[5m])) > 1
- 处理方法:确认 Pump 本地盘情况,进行修复

### binlog_pump_storage_available_size_less_than_20G
#### binlog_pump_storage_available_size_less_than_20G

- 含义:Pump 剩余可用磁盘空间不足 20G
- 监控规则:binlog_pump_storage_storage_size_bytes{type="available"} < 20 * 1024 * 1024 * 1024
- 处理方法:监控确认 Pump gc_tso 正常,需要的话调整 Pump gc 时间配置或者下线对应 Pump

### binlog_drainer_checkpoint_tso_no_change_for_1m
#### binlog_drainer_checkpoint_tso_no_change_for_1m

- 含义:Drainer checkpoint 一分钟没有更新
- 监控规则:changes(binlog_drainer_checkpoint_tso[1m]) < 1
- 处理方法:确认是否所有非下线 Pump 正常运行

### binlog_drainer_execute_duration_time_more_than_10s
#### binlog_drainer_execute_duration_time_more_than_10s

- 含义:Drainer 同步到 TiDB 的 transaction 耗时;如果过大则影响 Drainer 同步
- 监控规则:histogram_quantile(0.9, rate(binlog_drainer_execute_duration_time_bucket[1m])) > 10
Expand Down