Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: add diagnosis rule to detect cluster critical errors #14743

Merged
merged 5 commits into from
Feb 13, 2020
Merged

executor: add diagnosis rule to detect cluster critical errors #14743

merged 5 commits into from
Feb 13, 2020

Conversation

lonng
Copy link
Contributor

@lonng lonng commented Feb 12, 2020

What problem does this PR solve?

This PR adds a new diagnosis rule, which is used to detect whether critical errors occurred in the cluster. We will detect the following metrics tables in the current implementation:

  • tidb_failed_query_opm
  • tikv_critical_error
  • tidb_panic_count
  • tidb_binlog_error_count
  • pd_cmd_fail_ops
  • tidb_kv_region_error_ops
  • tidb_lock_resolver_ops
  • tikv_scheduler_is_busy
  • tikv_coprocessor_is_busy
  • tikv_channel_full_total
  • tikv_coprocessor_request_error
  • tidb_schema_lease_error_opm
  • tidb_transaction_retry_error_ops
  • tikv_grpc_errors

What is changed and how it works?

Check the metrics table and check whether some errors occurred in the past.

Check List

Tests

  • Unit test

Release note

  • Add diagnosis rule critical-error which is used to detect cluster critical errors

@lonng lonng added this to the v4.0.0-beta.1 milestone Feb 12, 2020
@lonng lonng removed the status/WIP label Feb 12, 2020
Signed-off-by: Lonng <heng@lonng.org>
Signed-off-by: Lonng <heng@lonng.org>
Copy link
Contributor

@crazycs520 crazycs520 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lonng lonng requested a review from Deardrops February 13, 2020 06:47
Copy link
Contributor

@Deardrops Deardrops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

},
}

for _, cas := range cases {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for _, cas := range cases {
for _, case := range cases {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case is a keyword.

@lonng
Copy link
Contributor Author

lonng commented Feb 13, 2020

/merge

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Feb 13, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Feb 13, 2020

/run-all-tests

@sre-bot sre-bot merged commit 2f926df into pingcap:master Feb 13, 2020
@lonng lonng deleted the diag-errors branch February 14, 2020 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution status/can-merge Indicates a PR has been approved by a committer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants