-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grafana agent huge host resource utilization #295
Comments
agent.zip |
same problems after update to agent v.0.38.1
prometheus endpoint avail at this time. Do you have any updates? |
UPD1:
Used metrics generator (https://github.com/grafana/fake-metrics-generator) for simulate metrics insert rate (7 instances).
Used Victoria-metrics cluster for collection metrics from VM1 (address for remote write 172.18.200.55:7480) . Reproduce:
What you will see: grafana agent after restart, will utilize CPU to 100% trying send wal to VM-cluster, after some times (in my case 8 minutes) agent state changed to unhealthy and stucked with logs:
Okay, let's start vmstorage and wal will be replicate to remote storage and everything will be okay, but no. Agent still get status unhealthy and utilize 100% cpu. Victoria metrics dashboard shown periodically peaks for inserts, and then no inserts. UPD2: Found issue at prometheus github about time ticker prometheus/prometheus#13111. in v2.47.2 (in 2.45+ this behavior was changed to actually rely on an external actor calling Notify when a new segment needs to be read) prometheus was rolled out readticker
Grafana agent from v0.37.2 use this version of prometheus and proboaly it makes that problems. May you check it? |
Hello grafana team! Did you have any updates for this problem? |
added issue to prometheus prometheus/prometheus#13317 |
How many logs are you tracking with the sd? My thought is that log tracking is starving metrics. If you remove the logs section does the problem persist? |
@mattdurham i don't know how many logs, but pormtail component in agent, got some troubes too. While Loki edpoint can't receive logs for a some time, promtail component may use huge CPU. As workaround - drop records for last 30 minutes, and when loki endoint stay to available - everyghing is okay. Logs be pushed, and no huge utilisation. But for metrics, this not working, prometheus component don't have some settings for drop wal log, after some period. And mostly strange thing is that: while metrics endpoint available, grafana-agent still can't push metrics, before it not being restarted. |
So, grafana-agent seems like a killer for VM, while remote write is not available, prometheus component on remote-write will be push metrics, and take more and more resources, when your long-term storage on maintenace for example. And surprise for us, when remote write endpoint is really working, prometheus component got error, after restart everything is okay, and no error. |
Are you looking at memory usage reported by the operating system or by the metric Do you know what your active series is? |
Series scraped by cadvisor or node-exporter components, and in general problem with CPU utilization by grafana-agent. I'm not sure, but it looks like missunderstand. In my posts you can see all configs and diagnostic. |
Without getting a pprof/profile its going to be hard to narrow down. I would try separating out your metrics and logs gathering to see which is causing the cpu spike. |
Please show us the metrics for the # of shards that are running over time, prior to the issue/during the remote endpoint shutdown/during the CPU spike. It's likely that remote write is just using all the available resources you've configured it to use/it has available. Remote write will try to scale up in the event that it notices a backlog of samples to send. |
Hi @mattdurham! Was tried to got pprofs closely to incident, at next second debug api was not avail. |
Looking at the pprof the cpu is primarily driven by the docker/cadvisor GetStats call. How many containers are being ran? If you remove the cadvisor/docker integration do you still see the issue? |
7 containers of metrics-generator, and grafana-agent. It all what deployed.
Was checked, and no - don't reproduced. Set Cadvisor to enabled: false. |
@mattdurham
After that, cadvisor was disabled from agegnt config, and added as different container to node, with same config as agent. And i didn't see problems with CPU. looks like a resolution but it not comfortable, need to deploy 2 containers, instead of use all in one grafana-agent. |
@mattdurham hello, any updates? |
UP |
Hi there 👋 On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025. To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :) |
What's wrong?
Faced with high CPU utilization on host by the grafana agent (IOwait or SY).
By logs, we see Prometheus and Promtail components have some problems.
Use grafana-agent in docker, network_mode=host.
After rebooting container, the problem disappears ,therefore, restarting the container with the debug level on the problem host does not really help in diagnosing.
Can't get Pprof at the time of the problem, grafana agent api not respond.
Example promtail:
ts=2023-10-15T03:12:36.514004234Z caller=client.go:419 level=warn component=logs logs_config=local component=client host=loki-asia-front01.itrf.tech msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://loki-asia-front01/loki/api/v1/push\": context deadline exceeded" ts=2023-10-15T03:13:03.783121812Z caller=dedupe.go:112 agent=prometheus instance=8ae6f456b9a2666ff9496060a50cb339 component=remote level=warn remote_name=8ae6f4-b0971a url=https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write\": context deadline exceeded"
This looks like a network issue but, while grafana-agent shown error in logs
msg="error sending batch, will retry" status=-1
endpoint loki-asia-front01 is avail and healthy.
Example prometheus:
ts=2023-09-29T05:41:28.555060542Z caller=dedupe.go:112 agent=prometheus instance=8ae6f456b9a2666ff9496060a50cb339 component=remote level=warn remote_name=8ae6f4-b0971a url=https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write\": dial tcp 10.200.161.249:443: i/o timeout"
Prometheus endpoint same as loki healthy and available.
Latest logs (only prometheus):
ts=2023-12-04T08:25:04.531162449Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto- aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write\": context deadli ne exceeded" ts=2023-12-04T08:25:06.517765622Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto- aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write\": dial tcp 10.20 0.173.249:443: i/o timeout"
Curl from host:
Curl from container grafana-agent
No problem with endpoint =\
If need some data about compose\docker\hosts, or something else - let me know.
Steps to reproduce
Can't to reproduce, will try to block promtail or prometheus, by iptables, got same error in logs but utilization not huge/
System information
Linux mongo02-kz 6.0.7-301.fc37.x86_64
Software version
grafana agent v0.37.2
Configuration
Logs
No response
The text was updated successfully, but these errors were encountered: