Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts fire during Prometheus WAL replay #3915

Closed
dgrisonnet opened this issue Mar 11, 2021 · 14 comments · Fixed by #4508

Comments

@dgrisonnet
Copy link
Contributor

dgrisonnet commented Mar 11, 2021

Hello everyone! I wanted to start a discussion regarding Thanos sidecar alerts' behaviour during Prometheus WAL replay.

Currently, during Prometheus WAL replay, both the ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alert can fire based on the duration of the replay.

To put more context around that, ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown alerts are based on two different metrics:

  • thanos_sidecar_prometheus_up
  • thanos_sidecar_last_heartbeat_success_time_seconds

Both of these metrics are updated based the HTTP response of a heartbeat to the /api/v1/status/config endpoint of the Prometheus instance to which the Thanos sidecar is attached. Only a HTTP 2XX response would mean that Prometheus is up and that the heartbeat succeeded.

This works flawlessly, but it might not handle the case of a WAL replay correctly. During a WAL replay, Prometheus isn't ready and respond to queries with an HTTP 503 Service Unavailable until the replay is finished. This is the case for all the endpoints of its API except /-/healthy as Prometheus it is still healthy despite not being ready.

What this means in practice is that during a WAL replay, both alerts fire and report that a Thanos sidecar is unhealthy and that its Prometheus instance is down. However, what really happens is that both the sidecar and Prometheus are healthy but not ready. One is waiting for the other to become ready and the other is finishing its WAL replay.

That being said, I think that the thanos_sidecar_prometheus_up metrics should be updated to reflect Prometheus health instead of its readiness. However, I am quite uncertain for the sidecar healthiness. Should it reflect the healthiness/readiness of its Prometheus instance or should it report unhealthy when Prometheus isn't ready?

@simonpasquier
Copy link
Contributor

My 2 cents (but I'm not a Thanos maintainer): do we need thanos_sidecar_prometheus_up and ThanosSidecarPrometheusDown at all? Reading the code and you said, both alerts would fire anyway.

@onprem
Copy link
Member

onprem commented Mar 11, 2021

Looks like we need to change thanos_sidecar_prometheus_up to reflect Prometheus health (/-/healthy instead of /api/v1/status/config).

My 2 cents (but I'm not a Thanos maintainer): do we need thanos_sidecar_prometheus_up and ThanosSidecarPrometheusDown at all? Reading the code and you said, both alerts would fire anyway.

As of now, I agree that there is no point in having them if both alerts are going to fire at the same time. Though as proposed, if we change thanos_sidecar_prometheus_up to reflect Prometheus healthiness instead of readiness, having both of them makes sense. WDYT?

The issue of ThanosSidecarUnhealthy firing during WAL replay still remains though.

@idoqo
Copy link
Contributor

idoqo commented Mar 14, 2021

Hi @prmsrswt, I'm trying to get familiar with the Thanos codebase and would love to take this up with some guidance. But I'm not sure where to change the endpoints.

@bwplotka
Copy link
Member

We spent 30m on going through code, so hopefully @idoqo understand the codebase if we would like to change this. But.... I am not convinced we should.

First of I would exclude alerts from this discussion. We can arrange alerts in any way we want, we could create inhibit rules to make sure dependencies are set. I would love to focus on what thanos_sidecar_prometheus_up is for. In my opinion it's critical to know if sidecar CAN or CANNOT reach Prometheus and get the external labels. If we cannot, sidecar is unusable (unhealthy). So why are we marking it up if Prometheus is not yet up? (it's not ready to serve requests). Happy to discuss, but to us up is similar to prometheus up metrics. It's up when it's scrapable. If scrape fails then it's up = 0. So we have now similar behaviour, no? 🤔

Maybe it's alerting we might want to adjust then?

@dgrisonnet
Copy link
Contributor Author

Thank you for the clarification @bwplotka, it makes more sense now that you explained it. I was exactly wondering if my understanding of these metrics was correct before making any assumptions with the alerts, hence why I started this discussion.

As for alerting, I think the alerts should be made more resilient to WAL replays as during these periods where the sidecar is unhealthy, there is nothing the administrator can really do, at the exception of waiting for Prometheus to be ready. One way we could do that, is by taking into account the rate of prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} when the sidecar is unhealthy.

On another note, is it really meaningful to keep both alerts? Currently, both of them occurs for the same reason which is the sidecar being unable to reach the Prometheus instance.

@stale
Copy link

stale bot commented Jun 2, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jun 2, 2021
@dgrisonnet
Copy link
Contributor Author

dgrisonnet commented Jun 2, 2021

This issue is still valid.

cc @arajkumar @slashpai can you please have a look once you have some time on your hands?

@stale stale bot removed the stale label Jun 2, 2021
@bill3tt
Copy link
Contributor

bill3tt commented Jun 29, 2021

@arajkumar raised this issue again in slack here

@arajkumar
Copy link
Contributor

Apparently I missed this discussion :) Proposed solution has been already discussed here. Let me address @bwplotka concern w.r.t alerts instead of tweaking the metrics.

@arajkumar
Copy link
Contributor

arajkumar commented Jul 6, 2021

@dgrisonnet ,

One way we could do that, is by taking into account the rate of prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} when the sidecar is unhealthy.

IMHO, prometheus_http_requests_total{code="503",handler="/api/v1/status/config"} is co-related with thanos_sidecar_last_heartbeat_success_time_seconds and would exhibit the same problem during WAL replay.

How about making use of prometheus_tsdb_head_max_time_seconds which is closely related to WAL. Another option would be to add a gauge metric to prometheus to indicate the WAL replay is in progress. I don't know, may be existing one prometheus_tsdb_data_replay_duration_seconds[1] which tracks the time duration for WAL replay also could be used?

[1] https://github.com/prometheus/prometheus/blob/main/tsdb/head.go#L211

CC: @simonpasquier @bwplotka

@arajkumar
Copy link
Contributor

@dgrisonnet , @bwplotka

Yesterday, myself and @paulfantom had pretty long slack programming :) He helped me to tweak the alert expression by including prometheus_tsdb_data_replay_duration_seconds metrics from Prometheus. Below is a modified one.,

expr: |||
    time() - max by ($(commonDimensions)s, $(dimensions)) (thanos_sidecar_last_heartbeat_success_time_seconds{%(selector)s}) >= 240
    AND on (%(commonDimensions)s) (
      min by (%(commonDimensions)s) (prometheus_tsdb_data_replay_duration_seconds) != 0
    )
  ||| % thanos.sidecar,

We had to introduce another label set(commonDimensions) which is common between prometheus & thanos.

However @paulfantom raised a similar concern as @simonpasquier about having two different ThanosSidecarUnhealthy and ThanosSidecarPrometheusDown related to same situation. @paulfantom 's suggestion was to remove ThanosSidecarPrometheusDown as it seem to be a cause based rather than a symptom based alert.

It would be helpful if you folks can throw some light on this. Thank you!

@dgrisonnet
Copy link
Contributor Author

I really like this approach based on the prometheus_tsdb_data_replay_duration_seconds 👍

I would also be in favor of removing ThanosSidecarPrometheusDown, but I am concerned that we will not be able to catch instances where Prometheus was never able to complete its WAL replay. For example when Prometheus is OOMKilled before completing its WAL replay.

@paulfantom
Copy link
Contributor

paulfantom commented Jul 23, 2021

I am concerned that we will not be able to catch instances where Prometheus was never able to complete its WAL replay. For example when Prometheus is OOMKilled before completing its WAL replay.

Those should result in ThanosSidecarUnhealthy shouldn't they? And if so, then symptom is the same and correct steps for investigation should be written in a runbook.

@stale
Copy link

stale bot commented Sep 22, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 22, 2021
bwplotka added a commit that referenced this issue Sep 24, 2021
…ionToStartedPrometheus`; Remove `ThanosSidecarPrometheusDown` alert; Remove unused `thanos_sidecar_last_heartbeat_success_time_seconds` metrics (#4508)

* Refactor sidecar alerts

Prior to this fix, ThanosSidecarUnhealthy would fire even when
Prometheus is busy with WAL replay. This would trigger a false positive alert.

This PR considers prometheus_tsdb_data_replay_duration_seconds metric from
Prometheus for ThanosSidecarUnhealthy alert. In order to correlate
Thanos and Prometheus metrics we need to specify common label(s) which
can be confiured through thanosPrometheusCommonDimensions jsonnet
variable.

This PR also removes ThanosSidecarPrometheusDown as it would fire at the same as ThanosSidecarUnhealthy.

Fixes #3915.

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Rename ThanosSidecarUnhealthy to ThanosSidecarNoConnectionToStartedPrometheus

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Simplify ThanosSidecarNoConnectionToStartedPrometheus using thanos_sidecar_prometheus_up

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

* Remove unused implementation of thanos_sidecar_last_heartbeat_success_time_seconds metric

Signed-off-by: Arunprasad Rajkumar <arajkuma@redhat.com>

Co-authored-by: Bartlomiej Plotka <bwplotka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment