Skip to content

Commit

Permalink
tiflash: add the alert-rules.md (#2191) (#2243)
Browse files Browse the repository at this point in the history
  • Loading branch information
sre-bot authored Apr 10, 2020
1 parent ec93867 commit 5007744
Show file tree
Hide file tree
Showing 2 changed files with 70 additions and 0 deletions.
1 change: 1 addition & 0 deletions TOC.md
Original file line number Diff line number Diff line change
Expand Up @@ -304,6 +304,7 @@
- [Scale TiFlash](/reference/tiflash/scale.md)
- [Upgrade TiFlash Nodes](/reference/tiflash/upgrade.md)
- [Configure TiFlash](/reference/tiflash/configuration.md)
- [TiFlash Alert Rules](/reference/tiflash/alert-rules.md)
- [Tune TiFlash Performance](/reference/tiflash/tune-performance.md)
- [FAQ](/reference/tiflash/faq.md)
+ TiDB Binlog
Expand Down
69 changes: 69 additions & 0 deletions reference/tiflash/alert-rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
title: TiFlash Alert Rules
summary: Learn the alert rules of the TiFlash cluster.
category: reference
---

# TiFlash Alert Rules

This document introduces the alert rules of the TiFlash cluster.

## `TiFlash_schema_error`

- Alert rule:

`increase(tiflash_schema_apply_count{type="failed"}[15m]) > 0`

- Description:

When the schema apply error occurs, an alert is triggered.

- Solution:

The error might be caused by some wrong logic. Contact [TiFlash R&D](mailto:support@pingcap.com) for support.

## `TiFlash_schema_apply_duration`

- Alert rule:

`histogram_quantile(0.99, sum(rate(tiflash_schema_apply_duration_seconds_bucket[1m])) BY (le, instance)) > 20`

- Description:

When the probability that the apply duration exceeds 20 seconds is over 99%, an alert is triggered.

- Solution:

It might be caused by the internal problems of the TiFlash TMT engine. Contact [TiFlash R&D](mailto:support@pingcap.com) for support.

## `TiFlash_raft_read_index_duration`

- Alert rule:

`histogram_quantile(0.99, sum(rate(tiflash_raft_read_index_duration_seconds_bucket[1m])) BY (le, instance)) > 3`

- Description:

When the probability that the read index duration exceeds 3 seconds is over 99%, an alert is triggered.

> **Note:**
>
> `read index` is the kvproto request sent to the TiKV leader. TiKV region retries, busy store, or network problems might lead to long request time of `read index`.
- Solution:

The frequent retries might be caused by frequent splitting or migration of the TiKV cluster. You can check the TiKV cluster status to identify the retry reason.

## `TiFlash_raft_wait_index_duration`

- Alert rule:

`histogram_quantile(0.99, sum(rate(tiflash_raft_wait_index_duration_seconds_bucket[1m])) BY (le, instance)) > 2`

- Description:

When the probability that the waiting time for Region Raft Index in TiFlash exceeds 2 seconds is over 99%, an alert is triggered.

- Solution:

It might be caused by a communication error between TiKV and the proxy. Contact [TiFlash R&D](mailto:support@pingcap.com) for support.

0 comments on commit 5007744

Please sign in to comment.