Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics #4508

Merged
merged 4 commits into from
Sep 24, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ We use *breaking :warning:* to mark changes that are not backward compatible (re
### Added
- [#4680](https://github.com/thanos-io/thanos/pull/4680) Query: add `exemplar.partial-response` flag to control partial response.

### Fixed

- [#4508](https://github.com/thanos-io/thanos/pull/4508) Adjust and rename `ThanosSidecarUnhealthy` to `ThanosSidecarNoConnectionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics.

## v0.23.0 - In Progress

### Added
Expand Down
6 changes: 0 additions & 6 deletions cmd/thanos/sidecar.go
Original file line number Diff line number Diff line change
Expand Up @@ -138,10 +138,6 @@ func runSidecar(
Name: "thanos_sidecar_prometheus_up",
Help: "Boolean indicator whether the sidecar can reach its Prometheus peer.",
})
arajkumar marked this conversation as resolved.
Show resolved Hide resolved
lastHeartbeat := promauto.With(reg).NewGauge(prometheus.GaugeOpts{
Name: "thanos_sidecar_last_heartbeat_success_time_seconds",
Help: "Timestamp of the last successful heartbeat in seconds.",
})

ctx, cancel := context.WithCancel(context.Background())
g.Add(func() error {
Expand Down Expand Up @@ -191,7 +187,6 @@ func runSidecar(
)
promUp.Set(1)
statusProber.Ready()
lastHeartbeat.SetToCurrentTime()
return nil
})
if err != nil {
Expand All @@ -213,7 +208,6 @@ func runSidecar(
promUp.Set(0)
} else {
promUp.Set(1)
lastHeartbeat.SetToCurrentTime()
}

return nil
Expand Down
24 changes: 8 additions & 16 deletions examples/alerts/alerts.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,16 +296,6 @@ rules:
```yaml mdox-exec="cat examples/tmp/thanos-sidecar.yaml"
name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand All @@ -316,14 +306,16 @@ rules:
for: 5m
labels:
severity: critical
- alert: ThanosSidecarUnhealthy
- alert: ThanosSidecarNoConnectionToStartedPrometheus
annotations:
description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}}
seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
description: Thanos Sidecar {{$labels.instance}} is unhealthy.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus
summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
healthy and has reloaded WAL.
expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
AND on (namespace, pod)
prometheus_tsdb_data_replay_duration_seconds != 0
for: 5m
labels:
severity: critical
Expand Down
24 changes: 8 additions & 16 deletions examples/alerts/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -301,16 +301,6 @@ groups:
severity: warning
- name: thanos-sidecar
rules:
- alert: ThanosSidecarPrometheusDown
bwplotka marked this conversation as resolved.
Show resolved Hide resolved
annotations:
description: Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown
summary: Thanos Sidecar cannot connect to Prometheus
expr: |
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
for: 5m
labels:
severity: critical
- alert: ThanosSidecarBucketOperationsFailed
annotations:
description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
Expand All @@ -321,14 +311,16 @@ groups:
for: 5m
labels:
severity: critical
- alert: ThanosSidecarUnhealthy
- alert: ThanosSidecarNoConnectionToStartedPrometheus
annotations:
description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than
{{$value}} seconds.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy
summary: Thanos Sidecar is unhealthy.
description: Thanos Sidecar {{$labels.instance}} is unhealthy.
runbook_url: https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus
summary: Thanos Sidecar cannot access Prometheus, even though Prometheus seems
healthy and has reloaded WAL.
expr: |
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~".*thanos-sidecar.*"}) >= 240
thanos_sidecar_prometheus_up{job=~".*thanos-sidecar.*"} == 0
AND on (namespace, pod)
prometheus_tsdb_data_replay_duration_seconds != 0
for: 5m
labels:
severity: critical
Expand Down
133 changes: 40 additions & 93 deletions examples/alerts/tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,127 +7,74 @@ evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0"}'
values: '5 10 43 17 11 0 0 0'
- series: 'thanos_sidecar_last_heartbeat_success_time_seconds{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1"}'
values: '4 9 42 15 10 0 0 0'
promql_expr_test:
- expr: time()
eval_time: 1m
exp_samples:
- labels: '{}'
value: 60
- expr: time()
eval_time: 2m
exp_samples:
- labels: '{}'
value: 120
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 2m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 43
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 42
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 10m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 0
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 0
- expr: max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 11m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 0
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 0
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 10m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 600
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 600
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance)
eval_time: 11m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 660
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 660
- expr: time() - max(thanos_sidecar_last_heartbeat_success_time_seconds{job="thanos-sidecar"}) by (job, instance) >= 600
eval_time: 12m
exp_samples:
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-0"}'
value: 720
- labels: '{job="thanos-sidecar", instance="thanos-sidecar-1"}'
value: 720
- series: 'thanos_sidecar_prometheus_up{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-0", pod="prometheus-0"}'
values: '1x5 0x15'
- series: 'thanos_sidecar_prometheus_up{namespace="production", job="thanos-sidecar", instance="thanos-sidecar-1", pod="prometheus-1"}'
values: '1x4 0x15'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-0", pod="prometheus-0"}'
values: '4x5 0x5 5x15'
- series: 'prometheus_tsdb_data_replay_duration_seconds{namespace="production", job="prometheus-k8s", instance="prometheus-k8s-1", pod="prometheus-1"}'
values: '10x14 0x6'
alert_rule_test:
- eval_time: 1m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 2m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 3m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
- eval_time: 10m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 600 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 11m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 660 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 12m
alertname: ThanosSidecarUnhealthy
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-0
instance: thanos-sidecar-1
namespace: production
pod: prometheus-1
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'
- eval_time: 20m
alertname: ThanosSidecarNoConnectionToStartedPrometheus
exp_alerts:
- exp_labels:
severity: critical
job: thanos-sidecar
instance: thanos-sidecar-1
instance: thanos-sidecar-0
namespace: production
pod: prometheus-0
exp_annotations:
description: 'Thanos Sidecar thanos-sidecar-1 is unhealthy for more than 720 seconds.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy'
summary: 'Thanos Sidecar is unhealthy.'
description: 'Thanos Sidecar thanos-sidecar-0 is unhealthy.'
runbook_url: 'https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus'
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.'

- interval: 1m
input_series:
- series: 'prometheus_rule_evaluations_total{namespace="production", job="thanos-ruler", instance="thanos-ruler-0"}'
Expand Down
1 change: 1 addition & 0 deletions mixin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ This project is intended to be used as a library. You can extend and customize d
},
sidecar+:: {
selector: 'job=~".*thanos-sidecar.*"',
thanosPrometheusCommonDimensions: 'namespace, pod',
title: '%(prefix)sSidecar' % $.dashboard.prefix,
},
// TODO(kakkoyun): Fix naming convention: bucketReplicate
Expand Down
25 changes: 7 additions & 18 deletions mixin/alerts/sidecar.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
local thanos = self,
sidecar+:: {
selector: error 'must provide selector for Thanos Sidecar alerts',
thanosPrometheusCommonDimensions: error 'must provide commonDimensions between Thanos and Prometheus metrics for Sidecar alerts',
dimensions: std.join(', ', std.objectFields(thanos.targetGroups) + ['job', 'instance']),
},
prometheusAlerts+:: {
Expand All @@ -10,20 +11,6 @@
{
name: 'thanos-sidecar',
rules: [
{
alert: 'ThanosSidecarPrometheusDown',
annotations: {
description: 'Thanos Sidecar {{$labels.instance}}%s cannot connect to Prometheus.' % location,
summary: 'Thanos Sidecar cannot connect to Prometheus',
},
expr: |||
thanos_sidecar_prometheus_up{%(selector)s} == 0
||| % thanos.sidecar,
'for': '5m',
labels: {
severity: 'critical',
},
},
{
alert: 'ThanosSidecarBucketOperationsFailed',
annotations: {
Expand All @@ -39,13 +26,15 @@
},
},
{
alert: 'ThanosSidecarUnhealthy',
alert: 'ThanosSidecarNoConnectionToStartedPrometheus',
annotations: {
description: 'Thanos Sidecar {{$labels.instance}}%s is unhealthy for more than {{$value}} seconds.' % location,
summary: 'Thanos Sidecar is unhealthy.',
description: 'Thanos Sidecar {{$labels.instance}}%s is unhealthy.' % location,
summary: 'Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.',
},
expr: |||
time() - max by (%(dimensions)s) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
thanos_sidecar_prometheus_up{%(selector)s} == 0
AND on (%(thanosPrometheusCommonDimensions)s)
prometheus_tsdb_data_replay_duration_seconds != 0
||| % thanos.sidecar,
'for': '5m',
labels: {
Expand Down
1 change: 1 addition & 0 deletions mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
},
sidecar+:: {
selector: 'job=~".*thanos-sidecar.*"',
thanosPrometheusCommonDimensions: 'namespace, pod',
title: '%(prefix)sSidecar' % $.dashboard.prefix,
},
// TODO(kakkoyun): Fix naming convention: bucketReplicate
Expand Down
3 changes: 1 addition & 2 deletions mixin/runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,8 @@

|Name|Summary|Description|Severity|Runbook|
|---|---|---|---|---|
|ThanosSidecarPrometheusDown|Thanos Sidecar cannot connect to Prometheus|Thanos Sidecar {{$labels.instance}} cannot connect to Prometheus.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarprometheusdown)|
|ThanosSidecarBucketOperationsFailed|Thanos Sidecar bucket operations are failing|Thanos Sidecar {{$labels.instance}} bucket operations are failing|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarbucketoperationsfailed)|
|ThanosSidecarUnhealthy|Thanos Sidecar is unhealthy.|Thanos Sidecar {{$labels.instance}} is unhealthy for more than {{$value}} seconds.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarunhealthy)|
|ThanosSidecarNoConnectionToStartedPrometheus|Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL.|Thanos Sidecar {{$labels.instance}} is unhealthy.|critical|[https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus](https://github.com/thanos-io/thanos/tree/main/mixin/runbook.md#alert-name-thanossidecarnoconnectiontostartedprometheus)|

## thanos-store

Expand Down
2 changes: 1 addition & 1 deletion pkg/rules/rules_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ func testRulesAgainstExamples(t *testing.T, dir string, server rulespb.RulesServ
{
Name: "thanos-sidecar",
File: filepath.Join(dir, "alerts.yaml"),
Rules: []*rulespb.Rule{someAlert, someAlert, someAlert},
arajkumar marked this conversation as resolved.
Show resolved Hide resolved
Rules: []*rulespb.Rule{someAlert, someAlert},
Interval: 60,
PartialResponseStrategy: storepb.PartialResponseStrategy_ABORT,
},
Expand Down