Skip to content

Commit

Permalink
Alerts: migrate RequestErrors and RulerRemoteEvaluationFailing to nat…
Browse files Browse the repository at this point in the history
…ive histograms (#9004)

* Alerts: enrich RequestErrors and RulerRemoteEvaluationFailing with a native histogram version

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Making lint happy

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Update operations/mimir-mixin/alerts/alerts.libsonnet

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>

* Making lint happy

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

---------

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
  • Loading branch information
duricanikolic and krajorama authored Aug 15, 2024
1 parent 1543093 commit 3512a1d
Show file tree
Hide file tree
Showing 6 changed files with 200 additions and 67 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@
* [ENHANCEMENT] Dashboards: remove "All" option for namespace dropdown in dashboards. #8829
* [ENHANCEMENT] Dashboards: add Kafka end-to-end latency outliers panel in the "Mimir / Writes" dashboard. #8948
* [ENHANCEMENT] Dashboards: add "Out-of-order samples appended" panel to "Mimir / Tenants" dashboard. #8939
* [ENHANCEMENT] Alerts: `RequestErrors` and `RulerRemoteEvaluationFailing` have been enriched with a native histogram version. #9004
* [BUGFIX] Dashboards: fix "current replicas" in autoscaling panels when HPA is not active. #8566
* [BUGFIX] Alerts: do not fire `MimirRingMembersMismatch` during the migration to experimental ingest storage. #8727

Expand Down
1 change: 1 addition & 0 deletions operations/helm/charts/mimir-distributed/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Entries should include a reference to the Pull Request that introduced the chang
* [ENHANCEMENT] Add support for setting namespace for dashboard config maps. #8813
* [ENHANCEMENT] Add support for string `extraObjects` for better support with templating. #8825
* [ENHANCEMENT] Helm : allow setting a read and write urls to continous-test. #7674
* [ENHANCEMENT] Alerts: `RequestErrors` and `RulerRemoteEvaluationFailing` have been enriched with a native histogram version. #9004
* [BUGFIX] Add missing container security context to run `continuous-test` under the restricted security policy. #8653
* [BUGFIX] Add `global.extraVolumeMounts` to the exporter container on memcached statefulsets #8787
* [BUGFIX] Fix helm releases failing when `querier.kedaAutoscaling.predictiveScalingEnabled=true`. #8731
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,31 @@ spec:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m]))
/
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
) * 100 > 1
for: 15m
labels:
histogram: classic
severity: critical
- alert: MimirRequestErrors
annotations:
message: |
The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
expr: |
# The following 5xx errors considered as non-error:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m])))
/
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{route!~"ready|debug_pprof"}[1m])))
) * 100 > 1
for: 15m
labels:
histogram: native
severity: critical
- alert: MimirRequestLatency
annotations:
Expand Down Expand Up @@ -485,13 +504,29 @@ spec:
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
100 * (
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", status_code=~"5..", job=~".*/(ruler-query-frontend.*)"}[5m]))
(
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
/
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) > 1
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) * 100 > 1
for: 5m
labels:
histogram: classic
severity: warning
- alert: MimirRulerRemoteEvaluationFailing
annotations:
message: |
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
(
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
/
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
) * 100 > 1
for: 5m
labels:
histogram: native
severity: warning
- name: gossip_alerts
rules:
Expand Down
45 changes: 40 additions & 5 deletions operations/mimir-mixin-compiled-baremetal/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,31 @@ groups:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m]))
/
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
) * 100 > 1
for: 15m
labels:
histogram: classic
severity: critical
- alert: MimirRequestErrors
annotations:
message: |
The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
expr: |
# The following 5xx errors considered as non-error:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m])))
/
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{route!~"ready|debug_pprof"}[1m])))
) * 100 > 1
for: 15m
labels:
histogram: native
severity: critical
- alert: MimirRequestLatency
annotations:
Expand Down Expand Up @@ -463,13 +482,29 @@ groups:
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
100 * (
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", status_code=~"5..", job=~".*/(ruler-query-frontend.*)"}[5m]))
(
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
/
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) > 1
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) * 100 > 1
for: 5m
labels:
histogram: classic
severity: warning
- alert: MimirRulerRemoteEvaluationFailing
annotations:
message: |
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
(
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
/
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
) * 100 > 1
for: 5m
labels:
histogram: native
severity: warning
- name: gossip_alerts
rules:
Expand Down
45 changes: 40 additions & 5 deletions operations/mimir-mixin-compiled/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,31 @@ groups:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..",status_code!~"529|598",route!~"ready|debug_pprof"}[1m]))
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m]))
/
sum by (cluster, namespace, job, route) (rate(cortex_request_duration_seconds_count{route!~"ready|debug_pprof"}[1m]))
) * 100 > 1
for: 15m
labels:
histogram: classic
severity: critical
- alert: MimirRequestErrors
annotations:
message: |
The route {{ $labels.route }} in {{ $labels.cluster }}/{{ $labels.namespace }} is experiencing {{ printf "%.2f" $value }}% errors.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrequesterrors
expr: |
# The following 5xx errors considered as non-error:
# - 529: used by distributor rate limiting (using 529 instead of 429 to let the client retry)
# - 598: used by GEM gateway when the client is very slow to send the request and the gateway times out reading the request body
(
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", status_code!~"529|598", route!~"ready|debug_pprof"}[1m])))
/
sum by (cluster, namespace, job, route) (histogram_count(rate(cortex_request_duration_seconds{route!~"ready|debug_pprof"}[1m])))
) * 100 > 1
for: 15m
labels:
histogram: native
severity: critical
- alert: MimirRequestLatency
annotations:
Expand Down Expand Up @@ -473,13 +492,29 @@ groups:
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
100 * (
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", status_code=~"5..", job=~".*/(ruler-query-frontend.*)"}[5m]))
(
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
/
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) > 1
sum by (cluster, namespace) (rate(cortex_request_duration_seconds_count{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m]))
) * 100 > 1
for: 5m
labels:
histogram: classic
severity: warning
- alert: MimirRulerRemoteEvaluationFailing
annotations:
message: |
Mimir rulers in {{ $labels.cluster }}/{{ $labels.namespace }} are failing to perform {{ printf "%.2f" $value }}% of remote evaluations through the ruler-query-frontend.
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirrulerremoteevaluationfailing
expr: |
(
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{status_code=~"5..", route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
/
sum by (cluster, namespace) (histogram_count(rate(cortex_request_duration_seconds{route="/httpgrpc.HTTP/Handle", job=~".*/(ruler-query-frontend.*)"}[5m])))
) * 100 > 1
for: 5m
labels:
histogram: native
severity: warning
- name: gossip_alerts
rules:
Expand Down
Loading

0 comments on commit 3512a1d

Please sign in to comment.