ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

dgrisonnet · 2021-03-11T14:26:55Z

Hello everyone! I wanted to start a discussion regarding Thanos sidecar alerts' behaviour during Prometheus WAL replay.

Currently, during Prometheus WAL replay, both the ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alert can fire based on the duration of the replay.

To put more context around that, ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts are based on two different metrics:

thanos_sidecar_prometheus_up
thanos_sidecar_last_heartbeat_success_time_seconds

Both of these metrics are updated based the HTTP response of a heartbeat to the /api/v1/status/config endpoint of the Prometheus instance to which the Thanos sidecar is attached. Only a HTTP 2XX response would mean that Prometheus is up and that the heartbeat succeeded.

This works flawlessly, but it might not handle the case of a WAL replay correctly. During a WAL replay, Prometheus isn't ready and respond to queries with an HTTP 503 Service Unavailable until the replay is finished. This is the case for all the endpoints of its API except /-/healthy as Prometheus it is still healthy despite not being ready.

What this means in practice is that during a WAL replay, both alerts fire and report that a Thanos sidecar is unhealthy and that its Prometheus instance is down. However, what really happens is that both the sidecar and Prometheus are healthy but not ready. One is waiting for the other to become ready and the other is finishing its WAL replay.

That being said, I think that the thanos_sidecar_prometheus_up metrics should be updated to reflect Prometheus health instead of its readiness. However, I am quite uncertain for the sidecar healthiness. Should it reflect the healthiness/readiness of its Prometheus instance or should it report unhealthy when Prometheus isn't ready?

The text was updated successfully, but these errors were encountered:

simonpasquier · 2021-03-11T16:05:33Z

My 2 cents (but I'm not a Thanos maintainer): do we need thanos_sidecar_prometheus_up and ThanosSidecarPrometheusDown at all? Reading the code and you said, both alerts would fire anyway.

onprem · 2021-03-11T18:15:17Z

Looks like we need to change thanos_sidecar_prometheus_up to reflect Prometheus health (/-/healthy instead of /api/v1/status/config).

My 2 cents (but I'm not a Thanos maintainer): do we need thanos_sidecar_prometheus_up and ThanosSidecarPrometheusDown at all? Reading the code and you said, both alerts would fire anyway.

As of now, I agree that there is no point in having them if both alerts are going to fire at the same time. Though as proposed, if we change thanos_sidecar_prometheus_up to reflect Prometheus healthiness instead of readiness, having both of them makes sense. WDYT?

The issue of ThanosSidecarUnhealthy firing during WAL replay still remains though.

idoqo · 2021-03-14T12:57:58Z

Hi @prmsrswt, I'm trying to get familiar with the Thanos codebase and would love to take this up with some guidance. But I'm not sure where to change the endpoints.

bwplotka · 2021-03-16T11:42:37Z

We spent 30m on going through code, so hopefully @idoqo understand the codebase if we would like to change this. But.... I am not convinced we should.

First of I would exclude alerts from this discussion. We can arrange alerts in any way we want, we could create inhibit rules to make sure dependencies are set. I would love to focus on what thanos_sidecar_prometheus_up is for. In my opinion it's critical to know if sidecar CAN or CANNOT reach Prometheus and get the external labels. If we cannot, sidecar is unusable (unhealthy). So why are we marking it up if Prometheus is not yet up? (it's not ready to serve requests). Happy to discuss, but to us up is similar to prometheus up metrics. It's up when it's scrapable. If scrape fails then it's up = 0. So we have now similar behaviour, no? 🤔

Maybe it's alerting we might want to adjust then?

dgrisonnet · 2021-03-17T13:26:24Z

Thank you for the clarification @bwplotka, it makes more sense now that you explained it. I was exactly wondering if my understanding of these metrics was correct before making any assumptions with the alerts, hence why I started this discussion.

As for alerting, I think the alerts should be made more resilient to WAL replays as during these periods where the sidecar is unhealthy, there is nothing the administrator can really do, at the exception of waiting for Prometheus to be ready. One way we could do that, is by taking into account the rate of prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} when the sidecar is unhealthy.

On another note, is it really meaningful to keep both alerts? Currently, both of them occurs for the same reason which is the sidecar being unable to reach the Prometheus instance.

stale · 2021-06-02T16:04:23Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

dgrisonnet · 2021-06-02T16:24:01Z

This issue is still valid.

cc @arajkumar @slashpai can you please have a look once you have some time on your hands?

bill3tt · 2021-06-29T11:02:52Z

@arajkumar raised this issue again in slack here

arajkumar · 2021-06-29T11:10:23Z

Apparently I missed this discussion :) Proposed solution has been already discussed here. Let me address @bwplotka concern w.r.t alerts instead of tweaking the metrics.

arajkumar · 2021-07-06T07:41:29Z

@dgrisonnet ,

One way we could do that, is by taking into account the rate of prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} when the sidecar is unhealthy.

IMHO, prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} is co-related with thanos_sidecar_last_heartbeat_success_time_seconds and would exhibit the same problem during WAL replay.

How about making use of prometheus_tsdb_head_max_time_seconds which is closely related to WAL. Another option would be to add a gauge metric to prometheus to indicate the WAL replay is in progress. I don't know, may be existing one prometheus_tsdb_data_replay_duration_seconds[1] which tracks the time duration for WAL replay also could be used?

[1] https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L211

CC: @simonpasquier @bwplotka

arajkumar · 2021-07-09T09:02:58Z

@dgrisonnet , @bwplotka

Yesterday, myself and @paulfantom had pretty long slack programming :) He helped me to tweak the alert expression by including prometheus_tsdb_data_replay_duration_seconds metrics from Prometheus. Below is a modified one.,

expr: |||
    time() - max by ($(commonDimensions)s, $(dimensions)) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
    AND on (%(commonDimensions)s) (
      min by (%(commonDimensions)s) (prometheus_tsdb_data_replay_duration_seconds) != 0
    )
  ||| % thanos.sidecar,

We had to introduce another label set(commonDimensions) which is common between prometheus & thanos.

However @paulfantom raised a similar concern as @simonpasquier about having two different ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown related to same situation. @paulfantom 's suggestion was to remove ThanosSidecarPrometheusDown as it seem to be a cause based rather than a symptom based alert.

It would be helpful if you folks can throw some light on this. Thank you!

dgrisonnet · 2021-07-22T14:18:12Z

I really like this approach based on the prometheus_tsdb_data_replay_duration_seconds 👍

I would also be in favor of removing ThanosSidecarPrometheusDown, but I am concerned that we will not be able to catch instances where Prometheus was never able to complete its WAL replay. For example when Prometheus is OOMKilled before completing its WAL replay.

paulfantom · 2021-07-23T08:12:37Z

I am concerned that we will not be able to catch instances where Prometheus was never able to complete its WAL replay. For example when Prometheus is OOMKilled before completing its WAL replay.

Those should result in ThanosSidecarUnhealthy shouldn't they? And if so, then symptom is the same and correct steps for investigation should be written in a runbook.

stale · 2021-09-22T02:43:56Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

…ionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics (#4508) * Refactor sidecar alerts Prior to this fix, ThanosSidecarUnhealthy would fire even when Prometheus is busy with WAL replay. This would trigger a false positive alert. This PR considers prometheus_tsdb_data_replay_duration_seconds metric from Prometheus for ThanosSidecarUnhealthy alert. In order to correlate Thanos and Prometheus metrics we need to specify common label(s) which can be confiured through thanosPrometheusCommonDimensions jsonnet variable. This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy. Fixes #3915. Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com> Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> * Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> * Simplify ThanosSidecarNoConnectionToStartedPrometheus using thanos_sidecar_prometheus_up Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> * Remove unused implementation of thanos_sidecar_last_heartbeat_success_time_seconds metric Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com> Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>

onprem added component: sidecar good first issue help wanted labels Mar 11, 2021

stale bot added the stale label Jun 2, 2021

stale bot removed the stale label Jun 2, 2021

arajkumar mentioned this issue Aug 2, 2021

Adjust and rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus; Remove ThanosSidecarPrometheusDown alert; Remove unused thanos_sidecar_last_heartbeat_success_time_seconds metrics #4508

Merged

2 tasks

imneov mentioned this issue Aug 8, 2021

receive:Tenant's readyS(*ReadyStorage) is not initialized when perform concurrent call getOrLoadTenant #4534

Closed

2 tasks

stale bot added the stale label Sep 22, 2021

bwplotka closed this as completed in #4508 Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

dgrisonnet commented Mar 11, 2021 •

edited

Loading

simonpasquier commented Mar 11, 2021

onprem commented Mar 11, 2021

idoqo commented Mar 14, 2021

bwplotka commented Mar 16, 2021

dgrisonnet commented Mar 17, 2021

stale bot commented Jun 2, 2021

dgrisonnet commented Jun 2, 2021 •

edited

Loading

bill3tt commented Jun 29, 2021

arajkumar commented Jun 29, 2021

arajkumar commented Jul 6, 2021 •

edited

Loading

arajkumar commented Jul 9, 2021

dgrisonnet commented Jul 22, 2021

paulfantom commented Jul 23, 2021 •

edited

Loading

stale bot commented Sep 22, 2021

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

Comments

dgrisonnet commented Mar 11, 2021 • edited Loading

simonpasquier commented Mar 11, 2021

onprem commented Mar 11, 2021

idoqo commented Mar 14, 2021

bwplotka commented Mar 16, 2021

dgrisonnet commented Mar 17, 2021

stale bot commented Jun 2, 2021

dgrisonnet commented Jun 2, 2021 • edited Loading

bill3tt commented Jun 29, 2021

arajkumar commented Jun 29, 2021

arajkumar commented Jul 6, 2021 • edited Loading

arajkumar commented Jul 9, 2021

dgrisonnet commented Jul 22, 2021

paulfantom commented Jul 23, 2021 • edited Loading

stale bot commented Sep 22, 2021

dgrisonnet commented Mar 11, 2021 •

edited

Loading

dgrisonnet commented Jun 2, 2021 •

edited

Loading

arajkumar commented Jul 6, 2021 •

edited

Loading

paulfantom commented Jul 23, 2021 •

edited

Loading