Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc for cluster diagnose in dashboard #3273

Merged
merged 24 commits into from
May 27, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a9d16f7
init
crazycs520 May 22, 2020
7cd05af
remove redundant file
crazycs520 May 22, 2020
2001a79
update toc
crazycs520 May 22, 2020
3d7a0c3
add usage and compare report
crazycs520 May 22, 2020
36779b6
Update access.md
breezewish May 25, 2020
f92cfa6
Update access.md
breezewish May 25, 2020
10a4a83
Update usage.md
breezewish May 25, 2020
426a93b
Update report.md
breezewish May 25, 2020
1f68f60
Merge branch 'docs-special-week' into diagnose-report1
crazycs520 May 26, 2020
dcc77c2
Merge branch 'docs-special-week' into diagnose-report1
crazycs520 May 26, 2020
c0eb8b5
address comment
crazycs520 May 26, 2020
61d91d5
Merge branch 'diagnose-report1' of https://github.com/crazycs520/docs…
crazycs520 May 26, 2020
0e47db4
Merge branch 'docs-special-week' into diagnose-report1
crazycs520 May 26, 2020
17ebf95
fix dead link
crazycs520 May 26, 2020
4bab76d
Merge branch 'docs-special-week' into diagnose-report1
breezewish May 26, 2020
503b097
Merge branch 'docs-special-week' into diagnose-report1
breezewish May 27, 2020
f01a9c9
rename file
crazycs520 May 27, 2020
4ee1993
Merge branch 'diagnose-report1' of https://github.com/crazycs520/docs…
crazycs520 May 27, 2020
d7882db
update file name and link
crazycs520 May 27, 2020
342eded
update file name and link
crazycs520 May 27, 2020
2aa9957
Merge branch 'docs-special-week' into diagnose-report1
TomShawn May 27, 2020
e561e67
Update dashboard-diagnostics-report.md
breezewish May 27, 2020
7bb1dc4
Merge branch 'docs-special-week' into diagnose-report1
breezewish May 27, 2020
16d11a1
Merge branch 'docs-special-week' into diagnose-report1
crazycs520 May 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions dashboard/dashboard-diagnose-access.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
crazycs520 marked this conversation as resolved.
Show resolved Hide resolved
title: 集群诊断页面
category: how-to
---

# 集群诊断页面

集群诊断是在指定的时间范围内,对集群可能存在的问题进行诊断,并将诊断结果和一些集群相关的负载监控信息汇总成一个诊断报告。诊断报告是网页形式,通过浏览器保存后可离线浏览和传阅。

> **注意:**
>
> 集群诊断功能依赖于集群中部署有 Prometheus 监控组件,参见 TiUP 或 TiDB Ansible 部署文档了解如何部署监控组件。若集群中没有部署监控组件,生成的诊断报告中将提示生成失败。
crazycs520 marked this conversation as resolved.
Show resolved Hide resolved

## 访问

可以通过以下两种方法访问集群诊断页面:

* 登录后,左侧导航条点击**集群诊断**(Cluster Diagnose):

![访问](/media/dashboard/diagnose/access.png)

* 在浏览器中访问 [http://127.0.0.1:2379/dashboard/#/diagnose](http://127.0.0.1:2379/dashboard/#/diagnose)(将 127.0.0.1:2379 替换为任意实际 PD 地址和端口)。

## 生成诊断报告

如果想对一个时间范围内的集群进行诊断,查看集群的负载等情况,可以使用以下步骤来生成一段时间范围的诊断报告:

1. 设置区间的开始时间,例如 2020-05-21 14:40:00。
2. 设置区间长度。例如 10 min 。
3. 点击开始。

![生成单个时间段的诊断报告](/media/dashboard/diagnose/gen-report.png)

> **建议:**
>
> 建议生成报告的时间范围在 1 min ~ 60 min 内,目前不建议生成超过 1 小时范围的报告。

以上操作会生成 2020-05-21 14:40:00 至 2020-05-21 14:50:00 时间范围的诊断报告。点击**开始**后,会看到以下界面,**生成进度**是生成报告的进度条,生成报告完成后,点击**查看报告**即可。

![生成报告的进度](/media/dashboard/diagnose/gen-process.png)

## 生成对比诊断报告

如果系统在某个时间点发生异常,如 QPS 抖动或者延迟变高,可以生成一份异常时间范围和正常时间范围的对比报告,例如:

* 系统异常时间段:2020-05-21 14:40:00 ~ 2020-05-21 14:45:00,系统异常时间
* 系统正常时间段:2020-05-21 14:30:00 ~ 2020-05-21 14:35:00,系统正常时间

生成以上两个时间范围的对比报告的步骤如下:

1. 设置区间的开始时间,即异常时间段的开始时间,如 2020-05-21 14:40:00
2. 设置区间长度。一般只系统异常的持续时间,例如 5 min
3. 开启与基线区间对比开关
4. 设置基线开始时间,即想要对比的系统正常时段的开始时间,如 2020-05-21 14:30:00
5. 点击开始

![生成对比报告](/media/dashboard/diagnose/gen-compare-report.png)

然后同样等报告生成完成后点击**查看报告**即可。

另外,已生成的诊断报告会显式在诊断报告主页的列表里面,可以点击查看之前生成的报告,不用重复生成。
362 changes: 362 additions & 0 deletions dashboard/dashboard-diagnose-report.md

Large diffs are not rendered by default.

112 changes: 112 additions & 0 deletions dashboard/dashboard-diagnose-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
---
title: 使用诊断报告定位问题
category: how-to
---

# 使用诊断报告定位问题

## 对比诊断功能示例

对比报告中有一个对比诊断的功能,通过对比2个时间段的监控项差异来尝试帮助 DBA 定位问题。先看几个示例。

### 大查询/写入导致 qps 抖动或延迟上升诊断

#### 示例 1

![QPS 图](/media/dashboard/diagnose/usage1.png)

上图中,是在跑 go-ycsb 的压测,可以发现,在 2020-03-10 13:24:30 时,QPS 突然开始下降,3 分钟后,QPS 开始恢复正常,这是为什么呢?

生成以下 2 个时间范围的对比报告:

T1: 2020-03-10 13:21:00 ,2020-03-10 13:23:00 ,正常时间段,又叫参考时间段
T2: 2020-03-10 13:24:30 , 2020-03-10 13:27:30 ,QPS 开始下降的异常时间段

这里2个时间间隔都是 3 分钟,因为抖动的影响范围就是3分钟。因为诊断时会用一些监控的平均值做对比,所有间隔时间太长会导致平均值差异不明显,然后无法准确定位问题。

生成报告后查看 **Compare Diagnose** 报表内容如下:

![对比诊断结果](/media/dashboard/diagnose/usage2.png)

上面诊断结果显示,在诊断的时间内可能有大查询,下面的每一行的含义是:

* tidb_qps 下降 0.93 倍
* tidb_query_duration, P999的查询延迟上升 1.54 倍
* tidb_cop_duration,P999 的 cop 请求的处理延迟上升 2.48 倍
* tidb_kv_write_num,P999 的 tidb 的事务写入 kv 数量上升 7.61 倍
* tikv_cop_scan_keys_total_nun, TiKV 的 coprocessor 扫描 key/value 的数量分别在 3 台 TiKV 上有很大的提升。
* pd_operator_step_finish_total_count 中, transfer leader 的数量上升 2.45 倍,说明异常时间段的调度比正常时间段要高。
* 提示可能有慢查询,并提示用 SQL 查询 TiDB 的慢日志。在 TiDB 中执行结果如下:

```sql
SELECT * FROM (SELECT count(*), min(time), sum(query_time) AS sum_query_time, sum(Process_time) AS sum_process_time, sum(Wait_time) AS sum_wait_time, sum(Commit_time), sum(Request_count), sum(process_keys), sum(Write_keys), max(Cop_proc_max), min(query),min(prev_stmt), digest FROM information_schema.CLUSTER_SLOW_QUERY WHERE time >= '2020-03-10 13:24:30' AND time < '2020-03-10 13:27:30' AND Is_internal = false GROUP BY digest) AS t1 WHERE t1.digest NOT IN (SELECT digest FROM information_schema.CLUSTER_SLOW_QUERY WHERE time >= '2020-03-10 13:21:00' AND time < '2020-03-10 13:24:00' GROUP BY digest) ORDER BY t1.sum_query_time DESC limit 10\G
***************************[ 1. row ]***************************
count(*) | 196
min(time) | 2020-03-10 13:24:30.204326
sum_query_time | 46.878509117
sum_process_time | 265.924
sum_wait_time | 8.308
sum(Commit_time) | 0.926820886
sum(Request_count) | 6035
sum(process_keys) | 201453000
sum(Write_keys) | 274500
max(Cop_proc_max) | 0.263
min(query) | delete from test.tcs2 limit 5000;
min(prev_stmt) |
digest | 24bd6d8a9b238086c9b8c3d240ad4ef32f79ce94cf5a468c0b8fe1eb5f8d03df
```

可以发现,从 13:24:30 开始有一个批量删除的大写入,一共执行了 196 次,每次删除 5000 行数据,总共耗时 46.8 秒。

#### 示例2

如果大查询一直没执行完,就不会记录慢日志,这个时候也能诊断吗?下面再看一个例子。

![QPS 图](/media/dashboard/diagnose/usage3.png)

上图中,也是在跑 go-ycsb 的压测,可以发现,在 2020-03-08 01:46:30 时,QPS 突然开始下降,并且一直没有恢复。

生成以下2个时间范围的对比报告:

T1 : 2020-03-08 01:36:00 ,2020-03-08 01:41:00 ,正常时间段,又叫参考时间段
T2: 2020-03-08 01:46:30 ,2020-03-08 01:51:30 ,QPS 开始下降的异常时间段

生成报告后看 `Compare Diagnose` 报表的内容如下:

![对比诊断结果](/media/dashboard/diagnose/usage4.png)

上面诊断结果和例 1 类似,这里不再重复赘述,直接看最后一行,提示可能有慢查询,并提示用 SQL 查询 TiDB 日志中的 expensive query。在 TiDB 中执行结果如下:

```sql
> SELECT * FROM information_schema.cluster_log WHERE type='tidb' AND time >= '2020-03-08 01:46:30' AND time < '2020-03-08 01:51:30' AND level = 'warn' AND message LIKE '%expensive_query%'\G
TIME | 2020/03/08 01:47:35.846
TYPE | tidb
INSTANCE | 172.16.5.40:4009
LEVEL | WARN
MESSAGE | [expensivequery.go:167] [expensive_query] [cost_time=60.085949605s] [process_time=2.52s] [wait_time=2.52s] [request_count=9] [total_keys=996009] [process_keys=996000] [num_cop_tasks=9] [process_avg_time=0.28s] [process_p90_time=0.344s] [process_max_time=0.344s] [process_max_addr=172.16.5.40:20150] [wait_avg_time=0.000777777s] [wait_p90_time=0.003s] [wait_max_time=0.003s] [wait_max_addr=172.16.5.40:20150] [stats=t_wide:pseudo] [conn_id=19717] [user=root] [database=test] [table_ids="[80,80]"] [txn_start_ts=415132076148785201] [mem_max="23583169 Bytes (22.490662574768066 MB)"] [sql="select count(*) from t_wide as t1 join t_wide as t2 where t1.c0>t2.c1 and t1.c2>0"]
```

上面查询结果显示,在 172.16.5.40:4009 这台 TiDB 上,2020/03/08 01:47:35.846 有一个已经执行了 60s,但还没有执行完的 的 expensive_query, 这个query 是个笛卡尔积的 join , 应该是有用户手抖了。

## 用对比报告定位问题

诊断有可能是误诊,使用对比报告或许可以帮助 DBA 更快速的定位问题。参考以下示例。

![QPS 图](/media/dashboard/diagnose/usage5.png)

上图中,也是在跑 go-ycsb 的压测,可以发现,在 2020-05-22 22:14:00 时,QPS 突然开始下降,大概在持续 3 分钟后恢复。

生成以下2个时间范围的对比报告:

T1: 2020-05-22 22:11:00 ,2020-05-22 22:14:00 ,正常时间段
T2: 2020-05-22 22:14:00 ,2020-05-22 22:17:00 ,QPS 开始下降的异常时间段

生成对比报告后,查看 **Max diff item** 报表,它是对比2个时间段的监控项后,按照监控项的差异大小排序,这个表的结果如下:

![对比结果](/media/dashboard/diagnose/usage6.png)

从上面结果可以看出,T2 时间段新增了很多倍的 Coprocessor 请求,可以猜测可能是 T2 时间段出现了一些大查询,或者是查询较多的负载。

实际上,在 t1 ~ t2 整个时间段内都在跑 `go-ycsb` 的压测,然后在 t2 时间段跑了 20 个 `tpch` 的查询,所以是因为 `tpch` 大查询导致了出现很多的 cop 请求。

这种大查询执行时间超过慢日志的阈值后也会记录在慢日志里面,可以继续查看 `Slow Queries In Time Range t2` 报表查看是否有一些慢查询。不过有一点主要注意的是,有些仅出现在 T2 时间段的慢查询,也可能是这种查询在 T1 时间段也存在,但是由于 T2 时间段的其他负载影响导致其执行变慢。
Binary file added media/dashboard/diagnose/access.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/cluster-hardware.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/cluster-info.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/compare-time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/config-change.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/error.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/example-table.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/gen-compare-report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/gen-process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/gen-report.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/goroutines-count.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/max-diff-item.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/node-load-info.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/process-cpu-usage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/process-memory-usage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/report-time-range.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/thread-cpu-usage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/tidb-ddl.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/tidb-txn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/time-relation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/total-time-consume.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/usage1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/dashboard/diagnose/usage2.png
Binary file added media/dashboard/diagnose/usage3.png
Binary file added media/dashboard/diagnose/usage4.png
Binary file added media/dashboard/diagnose/usage5.png
Binary file added media/dashboard/diagnose/usage6.png