Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

trutty · 2021-04-29T13:23:48Z

Is your proposal related to a problem?

The Thanos Sidecar container fails its heartbeat against Prometheus with the messages:

 level=warn ts=2021-04-29T12:58:27.849316431Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:58:57.84932781Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:59:27.84938737Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:59:57.849164466Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:00:27.849479648Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:00:57.849291086Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:01:27.849379989Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:01:57.849290597Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:02:27.849238919Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"

The heartbeat is configured to check against Prometheus' endpoint at /api/v1/status/config every 30s. A timeout will occur already after 5s (see https://github.com/thanos-io/thanos/blob/main/cmd/thanos/sidecar.go#L188).

This 5s-timeout is not enough in our setup, as the response from Prometheus always takes more time. From inside thanos-sidecar:

/tmp $ time wget http://127.0.0.1:9090/api/v1/status/config
Connecting to 127.0.0.1:9090 (127.0.0.1:9090)
saving to 'config'
config               100% |********************| 6845k  0:00:00 ETA
'config' saved
real	0m 24.15s
user	0m 0.00s
sys	0m 0.01s

Describe the solution you'd like

I think it would be good to separate the UpdateLabels and heartbeat functionality. Heartbeat could use Prometheus' /-/ready endpoint and UpdateLabels could not define a timeout.

Describe alternatives you've considered

Alternatives could be:

have Prometheus respond with its config within < 5s
have a configurable timeout duration for the heartbeat/UpdateLabels

Additional context

(Write your answer here.)

The text was updated successfully, but these errors were encountered:

wiardvanrij · 2021-04-29T20:51:59Z

I understand your issue and the described solutions make sense in such case. However I think the core issue here is the response time of your Prometheus. Do you have any idea why it takes 24(!!) seconds? :)
I've done a check myself and my status/config is only 50 times smaller. Yet this takes 0.02s.

So it's not like your config file is that huge, at least not to explain the 24s request time. Have you any idea why this is happening? Any chance you could share your config?

Maybe we can check to resolve the long request time. If it's a seriously 'legit' case then we could definitely make some changes.

trutty · 2021-05-11T13:34:57Z

We noticed that our Prometheus is currently throttling at around ~70% with 35 CPUs and 1582 jobs. I guess that is the reason for the slow response time, so we are first going to analyse this situation. Hopefully the response time will get better afterwards.

trutty · 2021-05-12T06:36:31Z

After refactoring our ServiceMonitor objects (thanks to prometheus/prometheus#8014) and therefore reducing the number of scrape jobs (config size now at 2285k), the CPU usage dropped from ~30 cores to ~7 cores. Prometheus' response time of the config endpoint currently is < 0.5s. The mentioned "heartbeat failed" problem is therefore fixed.

Thanks @wiardvanrij for pointing me in the right direction of the root cause :)
I will close this issue as no change is needed in the Thanos source code.

laileman · 2024-08-20T07:35:49Z

Why not use /-/ready ?

Why use UpdateLabels to check prometheus is up ? Is UpdateLabels need to do very often ?
Sometime, we can't avoid a huge prometheus config , I don't think it is good to setup a big value for get_config_interval and get_config_timeout because I need a sensitive alert
I think heartbeat should be a very tiny data health check , "GET" is totally good . It is no need to do anything on config.

wiardvanrij added component: sidecar needs-investigation labels Apr 29, 2021

trutty closed this as completed May 12, 2021

zvlb mentioned this issue Aug 9, 2022

sidecar: Add args prometheus.get_config_interval and prometheus.timeout_get_config #5573

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

trutty commented Apr 29, 2021

wiardvanrij commented Apr 29, 2021 •

edited

Loading

trutty commented May 11, 2021

trutty commented May 12, 2021

laileman commented Aug 20, 2024 •

edited

Loading

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

Comments

trutty commented Apr 29, 2021

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

wiardvanrij commented Apr 29, 2021 • edited Loading

trutty commented May 11, 2021

trutty commented May 12, 2021

laileman commented Aug 20, 2024 • edited Loading

wiardvanrij commented Apr 29, 2021 •

edited

Loading

laileman commented Aug 20, 2024 •

edited

Loading