Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

Closed
trutty opened this issue Apr 29, 2021 · 4 comments
Closed

Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121

trutty opened this issue Apr 29, 2021 · 4 comments

Comments

@trutty
Copy link
Contributor

trutty commented Apr 29, 2021

Is your proposal related to a problem?

The Thanos Sidecar container fails its heartbeat against Prometheus with the messages:

 level=warn ts=2021-04-29T12:58:27.849316431Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:58:57.84932781Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:59:27.84938737Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T12:59:57.849164466Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:00:27.849479648Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:00:57.849291086Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:01:27.849379989Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:01:57.849290597Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"
 level=warn ts=2021-04-29T13:02:27.849238919Z caller=sidecar.go:180 msg="heartbeat failed" err="perform GET request against http://127.0.0.1:9090/api/v1/status/config: Get \"http://127.0.0.1:9090/api/v1/status/config\": context deadline exceeded"

The heartbeat is configured to check against Prometheus' endpoint at /api/v1/status/config every 30s. A timeout will occur already after 5s (see https://github.com/thanos-io/thanos/blob/main/cmd/thanos/sidecar.go#L188).

This 5s-timeout is not enough in our setup, as the response from Prometheus always takes more time. From inside thanos-sidecar:

/tmp $ time wget http://127.0.0.1:9090/api/v1/status/config
Connecting to 127.0.0.1:9090 (127.0.0.1:9090)
saving to 'config'
config               100% |********************| 6845k  0:00:00 ETA
'config' saved
real	0m 24.15s
user	0m 0.00s
sys	0m 0.01s

Describe the solution you'd like

I think it would be good to separate the UpdateLabels and heartbeat functionality. Heartbeat could use Prometheus' /-/ready endpoint and UpdateLabels could not define a timeout.

Describe alternatives you've considered

Alternatives could be:

  • have Prometheus respond with its config within < 5s
  • have a configurable timeout duration for the heartbeat/UpdateLabels

Additional context

(Write your answer here.)

@wiardvanrij
Copy link
Member

wiardvanrij commented Apr 29, 2021

I understand your issue and the described solutions make sense in such case. However I think the core issue here is the response time of your Prometheus. Do you have any idea why it takes 24(!!) seconds? :)
I've done a check myself and my status/config is only 50 times smaller. Yet this takes 0.02s.

So it's not like your config file is that huge, at least not to explain the 24s request time. Have you any idea why this is happening? Any chance you could share your config?

Maybe we can check to resolve the long request time. If it's a seriously 'legit' case then we could definitely make some changes.

@trutty
Copy link
Contributor Author

trutty commented May 11, 2021

We noticed that our Prometheus is currently throttling at around ~70% with 35 CPUs and 1582 jobs. I guess that is the reason for the slow response time, so we are first going to analyse this situation. Hopefully the response time will get better afterwards.

@trutty
Copy link
Contributor Author

trutty commented May 12, 2021

After refactoring our ServiceMonitor objects (thanks to prometheus/prometheus#8014) and therefore reducing the number of scrape jobs (config size now at 2285k), the CPU usage dropped from ~30 cores to ~7 cores. Prometheus' response time of the config endpoint currently is < 0.5s. The mentioned "heartbeat failed" problem is therefore fixed.

Thanks @wiardvanrij for pointing me in the right direction of the root cause :)
I will close this issue as no change is needed in the Thanos source code.

@laileman
Copy link

laileman commented Aug 20, 2024

Why not use /-/ready ?

  1. Why use UpdateLabels to check prometheus is up ? Is UpdateLabels need to do very often ?
  2. Sometime, we can't avoid a huge prometheus config , I don't think it is good to setup a big value for get_config_interval and get_config_timeout because I need a sensitive alert
  3. I think heartbeat should be a very tiny data health check , "GET" is totally good . It is no need to do anything on config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants