-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thanos Sidecar heartbeat failed due to "context deadline exceeded" #4121
Comments
I understand your issue and the described solutions make sense in such case. However I think the core issue here is the response time of your Prometheus. Do you have any idea why it takes 24(!!) seconds? :) So it's not like your config file is that huge, at least not to explain the 24s request time. Have you any idea why this is happening? Any chance you could share your config? Maybe we can check to resolve the long request time. If it's a seriously 'legit' case then we could definitely make some changes. |
We noticed that our Prometheus is currently throttling at around ~70% with 35 CPUs and 1582 jobs. I guess that is the reason for the slow response time, so we are first going to analyse this situation. Hopefully the response time will get better afterwards. |
After refactoring our Thanks @wiardvanrij for pointing me in the right direction of the root cause :) |
Why not use /-/ready ?
|
Is your proposal related to a problem?
The Thanos Sidecar container fails its heartbeat against Prometheus with the messages:
The heartbeat is configured to check against Prometheus' endpoint at
/api/v1/status/config
every 30s. A timeout will occur already after 5s (see https://github.com/thanos-io/thanos/blob/main/cmd/thanos/sidecar.go#L188).This 5s-timeout is not enough in our setup, as the response from Prometheus always takes more time. From inside thanos-sidecar:
Describe the solution you'd like
I think it would be good to separate the
UpdateLabels
and heartbeat functionality. Heartbeat could use Prometheus'/-/ready
endpoint andUpdateLabels
could not define a timeout.Describe alternatives you've considered
Alternatives could be:
Additional context
(Write your answer here.)
The text was updated successfully, but these errors were encountered: