Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grafana agent huge host resource utilization #295

Open
b0kky opened this issue Dec 5, 2023 · 21 comments
Open

Grafana agent huge host resource utilization #295

b0kky opened this issue Dec 5, 2023 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@b0kky
Copy link

b0kky commented Dec 5, 2023

What's wrong?

Faced with high CPU utilization on host by the grafana agent (IOwait or SY).
By logs, we see Prometheus and Promtail components have some problems.
Use grafana-agent in docker, network_mode=host.
After rebooting container, the problem disappears ,therefore, restarting the container with the debug level on the problem host does not really help in diagnosing.
Can't get Pprof at the time of the problem, grafana agent api not respond.

Example promtail:
ts=2023-10-15T03:12:36.514004234Z caller=client.go:419 level=warn component=logs logs_config=local component=client host=loki-asia-front01.itrf.tech msg="error sending batch, will retry" status=-1 tenant= error="Post \"https://loki-asia-front01/loki/api/v1/push\": context deadline exceeded" ts=2023-10-15T03:13:03.783121812Z caller=dedupe.go:112 agent=prometheus instance=8ae6f456b9a2666ff9496060a50cb339 component=remote level=warn remote_name=8ae6f4-b0971a url=https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write\": context deadline exceeded"

This looks like a network issue but, while grafana-agent shown error in logs msg="error sending batch, will retry" status=-1
endpoint loki-asia-front01 is avail and healthy.
Example prometheus:
ts=2023-09-29T05:41:28.555060542Z caller=dedupe.go:112 agent=prometheus instance=8ae6f456b9a2666ff9496060a50cb339 component=remote level=warn remote_name=8ae6f4-b0971a url=https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write\": dial tcp 10.200.161.249:443: i/o timeout"

Prometheus endpoint same as loki healthy and available.

Latest logs (only prometheus):
ts=2023-12-04T08:25:04.531162449Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto- aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write\": context deadli ne exceeded" ts=2023-12-04T08:25:06.517765622Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto- aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write\": dial tcp 10.20 0.173.249:443: i/o timeout"

Curl from host:

*  start date: Nov 27 09:18:38 2023 GMT
*  expire date: Feb 25 09:18:37 2024 GMT
*  subjectAltName: host "metrics.sto-aws-stage" matched cert's "*.sto-aws-stage"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multiplexing
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* h2h3 [:method: POST]
* h2h3 [:path: /insert/0/prometheus/api/v1/import/csv?format=2:metric:ask,3:metric:bid,1:label:ticker,4:label:market]
* h2h3 [:scheme: https]
* h2h3 [:authority: metrics.sto-aws-stage]
* h2h3 [user-agent: curl/7.85.0]
* h2h3 [accept: */*]
* h2h3 [content-length: 19]
* h2h3 [content-type: application/x-www-form-urlencoded]
* Using Stream ID: 1 (easy handle 0x5633ba8b5870)
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
> POST /insert/0/prometheus/api/v1/import/csv?format=2:metric:ask,3:metric:bid,1:label:ticker,4:label:market HTTP/2
> Host: metrics.sto-aws-stage
> user-agent: curl/7.85.0
> accept: */*
> content-length: 19
> content-type: application/x-www-form-urlencoded
> 
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* We are completely uploaded and fine
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* Connection state changed (MAX_CONCURRENT_STREAMS == 128)!
* TLSv1.2 (OUT), TLS header, Supplemental data (23):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
* TLSv1.2 (IN), TLS header, Supplemental data (23):
< HTTP/2 204 
< date: Mon, 04 Dec 2023 11:06:13 GMT
< vary: Accept-Encoding
< x-server-hostname: vm-cluster-victoria-metrics-cluster-vminsert-8884b95b5-dwf7t
< strict-transport-security: max-age=15724800; includeSubDomains
< 
* Connection #0 to host metrics.sto-aws-stage left intact

Curl from container grafana-agent

root@mongo02:/# curl -v  -d "GOOG,1.23,4.56,NYSE" 'https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/import/csv?format=2:metric:ask,3:metric:bid,1:label:ticker,4:label:market'
*   Trying 10.200.167.249:443...
* Connected to metrics.sto-aws-stage (10.200.167.249) port 443 (#0)
* ALPN: offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=dev.2cash.ph
*  start date: Nov 27 09:18:38 2023 GMT
*  expire date: Feb 25 09:18:37 2024 GMT
*  subjectAltName: host "metrics.sto-aws-stage" matched cert's "*.sto-aws-stage"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: POST]
* h2h3 [:path: /insert/0/prometheus/api/v1/import/csv?format=2:metric:ask,3:metric:bid,1:label:ticker,4:label:market]
* h2h3 [:scheme: https]
* h2h3 [:authority: metrics.sto-aws-stage]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* h2h3 [content-length: 19]
* h2h3 [content-type: application/x-www-form-urlencoded]
* Using Stream ID: 1 (easy handle 0x55a53cb8d670)
> POST /insert/0/prometheus/api/v1/import/csv?format=2:metric:ask,3:metric:bid,1:label:ticker,4:label:market HTTP/2
> Host: metrics.sto-aws-stage
> user-agent: curl/7.88.1
> accept: */*
> content-length: 19
> content-type: application/x-www-form-urlencoded
> 
* We are completely uploaded and fine
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/2 204 
< date: Mon, 04 Dec 2023 11:08:55 GMT
< vary: Accept-Encoding
< x-server-hostname: vm-cluster-victoria-metrics-cluster-vminsert-8884b95b5-9vs5b
< strict-transport-security: max-age=15724800; includeSubDomains
< 
* Connection #0 to host metrics.sto-aws-stage left intact

No problem with endpoint =\

If need some data about compose\docker\hosts, or something else - let me know.

Steps to reproduce

Can't to reproduce, will try to block promtail or prometheus, by iptables, got same error in logs but utilization not huge/

System information

Linux mongo02-kz 6.0.7-301.fc37.x86_64

Software version

grafana agent v0.37.2

Configuration

Docker version:
Client:
 Version:           20.10.20
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        9fdeb9c
 Built:             Tue Oct 18 18:14:26 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.20
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       03df974
  Built:            Tue Oct 18 18:20:17 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.6.8
  GitCommit:        9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d1
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Logs

No response

@b0kky b0kky added the bug Something isn't working label Dec 5, 2023
@b0kky
Copy link
Author

b0kky commented Dec 5, 2023

For proofs
изображение

@b0kky
Copy link
Author

b0kky commented Dec 5, 2023

agent.zip
Grafana-agent config in attach

@b0kky
Copy link
Author

b0kky commented Dec 11, 2023

same problems after update to agent v.0.38.1
Huge CPU utilization, by grafana-agent, host is unvail at this time, at logs:

ts=2023-12-07T20:57:00.80255145Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage/insert/0/prometheus/api/v1/write\": context deadline exceeded"
ts=2023-12-07T20:26:11.553839461Z caller=wal.go:551 level=info agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 msg="WAL checkpoint complete" first=676 last=677 duration=52.658019ms
ts=2023-12-07T20:55:00.391558109Z caller=dedupe.go:112 agent=prometheus instance=7e9527f8948db1b008afdfc2db12c2c4 component=remote level=warn remote_name=7e9527-2a7f5b url=https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write msg="Failed to send batch, retrying" err="Post \"https://metrics.sto-aws-stage.itrf.tech/insert/0/prometheus/api/v1/write\": context deadline exceeded"

prometheus endpoint avail at this time.
After grafana-agent restart everything is ok.

Do you have any updates?

@b0kky
Copy link
Author

b0kky commented Dec 16, 2023

UPD1:
Was able to reproduce the problem:

VM1 with grafana agent: Linux test-agent38 6.0.7-301.fc37.x86_64
Fedora:37 with 2 CPU\2 RAM

Main fs dev: /dev/nvme0n1p3 with btrfs

Grafana-agent: version: 38.1

Tuned agent remote-write config (other settings as an early posted config)
    remote_write:
    - queue_config:
        capacity: 5000
        max_samples_per_send: 1000
        max_shards: 100
        min_backoff: 30s
        max_backoff: 60s
      remote_timeout: ${GA_METRICS_REMOTE_WRITE_TIMEOUT}
      url: http://172.18.200.55:7480/insert/0/prometheus/api/v1/write


Used metrics generator (https://github.com/grafana/fake-metrics-generator) for simulate metrics insert rate (7 instances).

VM2: Linux 6.1.5-060105-generic 
Ubuntu: 22.04 with 14 CPU and 16 RAM

Used Victoria-metrics cluster for collection metrics from VM1 (address for remote write 172.18.200.55:7480) .
All configuration default, upped with docker would to see here https://github.com/VictoriaMetrics/VictoriaMetrics/blob/master/deployment/docker/docker-compose-cluster.yml

Reproduce:

  1. Start on vm1 grafana agent with remote write to VM-cluster, as endpoint, and start fake metrics generator with 5 to 10 instances.
  2. After some times (about 15 minutes), stop vm-storage instance on VM2.
  3. Wait for CPU itilization at VM1, after that start vm-storage on vm2 and restart grafana-agent.

What you will see: grafana agent after restart, will utilize CPU to 100% trying send wal to VM-cluster, after some times (in my case 8 minutes) agent state changed to unhealthy and stucked with logs:

ts=2023-12-16T07:20:52.422941501Z caller=scrape.go:1384 level=debug agent=prometheus instance=b710687750a0f7ce59b90fbfec79fcf9 component="scrape manager" scrape_pool=integrations/node_exporter target=http://127.0.0.1:12345/integrations/node_exporter/metrics msg="Scrape failed" err="Get \"http://127.0.0.1:12345/integrations/node_exporter/metrics\": context deadline exceeded"

ts=2023-12-16T07:20:20.520631353Z caller=logging.go:100 level=warn msg="GET /integrations/agent/metrics 16.674951294s, error: write tcp 127.0.0.1:12345->127.0.0.1:45314: write: broken pipe ws: false; Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1; Accept-Encoding: gzip; User-Agent: GrafanaAgent/v0.38.1 (static; linux; docker); X-Prometheus-Scrape-Timeout-Seconds: 10; "

Okay, let's start vmstorage and wal will be replicate to remote storage and everything will be okay, but no. Agent still get status unhealthy and utilize 100% cpu.

Victoria metrics dashboard shown periodically peaks for inserts, and then no inserts.
изображение

UPD2: Found issue at prometheus github about time ticker prometheus/prometheus#13111. in v2.47.2 (in 2.45+ this behavior was changed to actually rely on an external actor calling Notify when a new segment needs to be read) prometheus was rolled out readticker

readTicker := time.NewTicker(readTimeout)
readTicker := time.NewTicker(10 * time.Millisecond)
defer readTicker.Stop()

Grafana agent from v0.37.2 use this version of prometheus and proboaly it makes that problems. May you check it?
So, readticker returned in 2.48 releases promteheus.

@ngnix
Copy link

ngnix commented Dec 18, 2023

Hello grafana team!

Did you have any updates for this problem?

@b0kky
Copy link
Author

b0kky commented Dec 21, 2023

added issue to prometheus prometheus/prometheus#13317

@mattdurham mattdurham self-assigned this Jan 8, 2024
@mattdurham
Copy link
Collaborator

How many logs are you tracking with the sd? My thought is that log tracking is starving metrics. If you remove the logs section does the problem persist?

@b0kky
Copy link
Author

b0kky commented Jan 12, 2024

@mattdurham i don't know how many logs, but pormtail component in agent, got some troubes too. While Loki edpoint can't receive logs for a some time, promtail component may use huge CPU. As workaround - drop records for last 30 minutes, and when loki endoint stay to available - everyghing is okay. Logs be pushed, and no huge utilisation. But for metrics, this not working, prometheus component don't have some settings for drop wal log, after some period. And mostly strange thing is that: while metrics endpoint available, grafana-agent still can't push metrics, before it not being restarted.
Try to reproduce it - some fake metrics generator in a container, and disable remote write endpoint, for some time.

@b0kky
Copy link
Author

b0kky commented Jan 12, 2024

So, grafana-agent seems like a killer for VM, while remote write is not available, prometheus component on remote-write will be push metrics, and take more and more resources, when your long-term storage on maintenace for example. And surprise for us, when remote write endpoint is really working, prometheus component got error, after restart everything is okay, and no error.

@mattdurham
Copy link
Collaborator

Are you looking at memory usage reported by the operating system or by the metric go_memstats_heap_inuse_bytes? Golang will release memory to the OS when the OS comes under memory pressure but otherwise will keep it allocated so it doesnt have to request it. This is under the theory that unused memory is wasted memory.

Do you know what your active series is?

@b0kky
Copy link
Author

b0kky commented Jan 12, 2024

Series scraped by cadvisor or node-exporter components, and in general problem with CPU utilization by grafana-agent. I'm not sure, but it looks like missunderstand. In my posts you can see all configs and diagnostic.
BTW go_memstats_heap_inuse_bytes can't be get, because grafana-agent not responding on port 12345, and i cant get profiles, or metrics or something else.

@mattdurham
Copy link
Collaborator

Without getting a pprof/profile its going to be hard to narrow down. I would try separating out your metrics and logs gathering to see which is causing the cpu spike.

@cstyan
Copy link

cstyan commented Jan 16, 2024

Please show us the metrics for the # of shards that are running over time, prior to the issue/during the remote endpoint shutdown/during the CPU spike.

It's likely that remote write is just using all the available resources you've configured it to use/it has available. Remote write will try to scale up in the event that it notices a backlog of samples to send.

@b0kky
Copy link
Author

b0kky commented Jan 17, 2024

Set grafana-agent default config:
capcaity: 10000
max_samples_per_send: 2000
max_shards: 50

By samples rate 1.53K exposed by metrics-generator
image

Yeah, you are right, grafana-agent container, after remote storage down, stay under memory pressure, after RAM limits got CPU is high.
Waht i noticed: if storage up, while the agent is utilize CPU, agent not sending any metrics. in logs i see:
Another thing I noticed: if you turn on the storage while the agent is recycling the CPU:

caller=logging.go:100 level=warn msg="GET /integrations/cadvisor/metrics 17m33.998004222s, error: write tcp 127.0.0.1:12345->127.0.0.1:41862: i/o timeout ws: false; Accept: application/openmetrics-text;version=1.0.0,application/openmetrics-text;version=0.0.1;q=0.75,text/plain;version=0.0.4;q=0.5,*/*;q=0.1; Accept-Encoding: gzip; User-Agent: GrafanaAgent/v0.39.0 (static; linux; docker); X-Prometheus-Scrape-Timeout-Seconds: 10; "

This is general problem - agent not answer on 12345 port and not send any metrics, after storage was up.
Notice2: disable all metrics generator, and leave agent for some time, it back from unhealthy status to healthy. Think it due to metrics queue.
Will try to get profiles, at time before incident and after it.

@b0kky
Copy link
Author

b0kky commented Jan 18, 2024

Hi @mattdurham! Was tried to got pprofs closely to incident, at next second debug api was not avail.
test9.zip

@mattdurham
Copy link
Collaborator

Looking at the pprof the cpu is primarily driven by the docker/cadvisor GetStats call. How many containers are being ran? If you remove the cadvisor/docker integration do you still see the issue?

@b0kky
Copy link
Author

b0kky commented Jan 29, 2024

@mattdurham

How many containers are being ran?

7 containers of metrics-generator, and grafana-agent. It all what deployed.

If you remove the cadvisor/docker integration do you still see the issue?

Was checked, and no - don't reproduced. Set Cadvisor to enabled: false.

@b0kky
Copy link
Author

b0kky commented Jan 31, 2024

@mattdurham
I tried different config cadvisor integrations, what i did:

  1. Try to add disable metrics, my list is:
    • disk
    • diskIO
    • percpu
    • process
    • sched
    • tcp
    • udp
    • referenced_memory
    • hugetlb
    • cpu_topology
    • resctrl
    • cpuset
    • memory_numa
  2. disable_root_cgroup_stats: true
  3. docker_only: true
  4. storage_duration: 10s (30s or 5s nothing effects)
  5. Try to scrape_integration: false, want to hack it) in this case, cadvisor didn't write metrics to remote_write queue and expose metrics, then i configured target /integrations/cadvisor/metrics as prometheus target for agent. It working, but still get CPU spikes =\

After that, cadvisor was disabled from agegnt config, and added as different container to node, with same config as agent. And i didn't see problems with CPU. looks like a resolution but it not comfortable, need to deploy 2 containers, instead of use all in one grafana-agent.
Maybe exist some extra settings for cadvisor exporter as part of grafana-agent config?

@b0kky
Copy link
Author

b0kky commented Feb 12, 2024

@mattdurham hello, any updates?

@b0kky
Copy link
Author

b0kky commented Mar 20, 2024

UP

@rfratto
Copy link
Member

rfratto commented Apr 11, 2024

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants