Increase resource defaults for monitoring stack #551

blancharda · 2024-07-08T19:26:07Z

Is your feature request related to a problem? Please describe.

The monitoring stack (prometheus, grafana, loki etc) have enough resources to start, but often struggle when scaled beyond a single node or with higher volume workloads. We should consider updating the default values, and provide clear guidance on suggested overrides for various deployment scales/sizes.

Additional context

Prometheus in particular seems to struggle even with relatively small workloads.

mjnagel · 2024-07-16T14:37:12Z

This may be a good reason to evaluate scalable/HA grafana + prometheus. For reference on DUBBD in the past we had tickets for HPAs on those two and noted necessary external dependencies:

Loki itself defaults to a scalable mode (but single replica) with no resource limits/requests.

I think we should definitely:

Document the overrides for scaling these up (resources as a first pass, replicas/hpa as we further explore/support those).
Identify any upstream guidance on sizing/scaling for each.
As we gather more data from CI/staging environments also document our own suggested sizing based on unique core needs.

…713) ## Description Document added for resource/HA overrides across core packages. Also ~doubles Prometheus' limits, but does not adjust the requests. This should ensure that Prometheus still schedules without requiring significant resources, but also allows it to consume more memory without hitting OOM errors. ## Related Issue Related to #551 ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [x] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [x] [Contributor Guide](https://github.com/defenseunicorns/uds-template-capability/blob/main/CONTRIBUTING.md) followed

mjnagel · 2024-08-30T17:25:01Z

Closing this as initial documentation has been merged. Opening a few follow on tickets with narrower scope out of other identified pieces here.

blancharda added the enhancement New feature or request label Jul 8, 2024

blancharda mentioned this issue Jul 8, 2024

Add Default Grafana Dashboards #207

Closed

mjnagel added the monitoring Issues related to monitoring components / resources label Jul 8, 2024

mjnagel added this to the 0.27.0 milestone Aug 21, 2024

mjnagel self-assigned this Aug 28, 2024

mjnagel mentioned this issue Aug 28, 2024

chore: update resources for prometheus, document resource overrides #713

Merged

5 tasks

mjnagel mentioned this issue Aug 30, 2024

Provide suggested configurations for small/medium/large deployments #715

Open

mjnagel closed this as completed Aug 30, 2024

mjnagel mentioned this issue Aug 30, 2024

Support/test Grafana HA #716

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase resource defaults for monitoring stack #551

Increase resource defaults for monitoring stack #551

blancharda commented Jul 8, 2024

mjnagel commented Jul 16, 2024

mjnagel commented Aug 30, 2024

Increase resource defaults for monitoring stack #551

Increase resource defaults for monitoring stack #551

Comments

blancharda commented Jul 8, 2024

Is your feature request related to a problem? Please describe.

Additional context

mjnagel commented Jul 16, 2024

mjnagel commented Aug 30, 2024