Add support for HPA in mimir-distributed #3430

jgutschon · 2022-11-10T19:56:32Z

Is your feature request related to a problem? Please describe.

I have been running Grafana Mimir in our infrastructure using the Helm chart and I'm not currently able to automatically scale mimir in response to changes in load on the servers.

Describe the solution you'd like

It would be very helpful if Horizontal Pod Autoscalers were supported natively in the mimir-distributed helm chart.

Describe alternatives you've considered

I've attempted to add this manually using custom manifests with the chart to create HPAs for each component (Distributor, Querier, etc.), however, this is not an optimal solution since the replicas field on each Deployment and StatefulSet cannot be removed with the current templating. When using a custom HPA with the helm chart, applying any changes to one of the Deployments/StatefulSets causes a scale down to the number specified by the replicas field, followed by a scale up once the HPA has updated the desired replicas. In production, this is not ideal since it will potentially terminate active pods and disrupt our cluster's monitoring.

As of now, my best solution is to fork the chart and add templating to omit this field, but it would be nice if this could be officially supported.

The text was updated successfully, but these errors were encountered:

Logiraptor · 2022-11-11T17:30:04Z

Thanks for opening this issue. I would be happy to review a PR for this if you have time.

At the moment, only two components have had enough testing with autoscaling to support officially: querier and distributor. We're working on autoscaling other components, but need time to properly test the stability in our production environments. You can see the Jsonnet implementation for distributor and querier here:

mimir/operations/mimir/autoscaling.libsonnet

Line 1 in 3d14b39

{

What do you think about opening a PR to start?

Logiraptor · 2022-11-11T17:31:24Z

By the way, that jsonnet template is rendered here to plain yaml in case Jsonnet is not something you're familiar with:

mimir/operations/mimir-tests/test-autoscaling-generated.yaml

Line 1 in 3d14b39

apiVersion: v1

jgutschon · 2022-11-14T17:06:31Z

Thanks, I'll consider opening a PR if I can find some time.

At the moment, only two components have had enough testing with autoscaling to support officially: querier and distributor.

What kind of testing is usually done on these? Based on the scaling docs I was under the impression that any component could be scaled up and down safely (besides some exceptions for scaling down alertmanager, ingester, and store-gateway).

By the way, that jsonnet template is rendered here to plain yaml in case Jsonnet is not something you're familiar with:

Not super familiar with Jsonnet so this is definitely helpful, thanks.

Logiraptor · 2022-11-18T22:11:26Z

What kind of testing is usually done on these?

We will typically run these kinds of things in production at Grafana Cloud for at least a few weeks to uncover anything unexpected. That doesn't mean we can't start experimenting, but it would need to be marked experimental to let users know we haven't run at scale yet.

Based on the scaling docs I was under the impression that any component could be scaled up and down safely

You're right that most components can be scaled up and down. Specifically I'm referring to autoscaling, which can mean scaling up or down much more often than a human operator would ever do. For example, we've needed to fine tune shutdown behavior for queriers. It's not so big a deal to close a connection once in a while when a human scales down, but if a machine is scaling queriers every few minutes, then that single connection closed error can start to have a significant impact on the service availability, even if it doesn't lead to data loss or any permanent issues.

jgutschon · 2023-02-11T22:10:24Z

Hi @Logiraptor, I've opened #4229 with some additions to create HPAs for the rest of the components since #4133 does not fully address this issue. Would love some feedback on this if you have some time to take a look, thanks.

jmichalek132 · 2023-10-04T09:08:04Z

Hi, I would like to ask if I raise a PR, which would allow not setting the replicas in the helm charts on components such as distributor, querier, and query-frontend would it be accepted? There doesn't seem to be any work going on in the linked PR recently. I tried adding HPA via extra objects here, https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L3650. But since we used argocd it would revert the changes to the replicas done by the HPA, this would also happen with every deploy when just using helm. Allowing to not set the replicas count in the helm chart would than prevent this from happening.

dimitarvdimitrov · 2023-10-06T16:31:42Z

that makes sense to me @jmichalek132, I think we had the same problem with HPA migrations in jsonnet. How would this be implemented? Perhaps a special value for replicas - null that is detected in the distributor, querier, etc. templates and it omits the value? Or do you have something else in mind?

jmichalek132 · 2023-10-09T08:15:54Z

that makes sense to me @jmichalek132, I think we had the same problem with HPA migrations in jsonnet. How would this be implemented? Perhaps a special value for replicas - null that is detected in the distributor, querier, etc. templates and it omits the value? Or do you have something else in mind?

I think that sounds reasonable should I raise a PR for it?

dimitarvdimitrov · 2023-10-09T10:19:51Z

yes, thank you!

jmichalek132 · 2023-10-09T19:55:19Z

yes, thank you!

So I did some testing,

I started with

  {{- if ne .Values.distributor.replicas "null"  }}
  replicas: {{ .Values.distributor.replicas }}
  {{- end }}

but helm complains when it's not null:

Error: template: mimir-distributed/templates/distributor/distributor-dep.yaml:11:9: executing "mimir-distributed/templates/distributor/distributor-dep.yaml" at <ne .Values.distributor.replicas "null">: error calling ne: incompatible types for comparison

So I tried:

  {{- if .Values.distributor.replicas }}
  replicas: {{ .Values.distributor.replicas }}
  {{- end }}

when replicas is a number i.e. the default 1 it works.
When it's set to null the replicas is not set, good.
But when it's set to 0 it's also unset. Now this is not a common case imho but it would break.
Would that be okay or should I look for another way to do it?

dimitarvdimitrov · 2023-10-10T11:28:42Z

can you open a draft PR and we can move the discussion there? I think this is somewhat tangential to the issue discussed here. Alternatively we can also keep the discussion going in another issue.

vaibhhavv · 2024-01-05T08:21:34Z

Hi @jmichalek132, @dimitarvdimitrov any update on the progress of the autoscaling features? as whole mimir community is looking forward to it as it will provide more stability to the components

dimitarvdimitrov · 2024-01-08T09:54:22Z

#4687 is the furthest we have gone with autoscaling in the helm chart. That PR hasn't progressed a ton lately. This comment has some notes on what I see as the next steps for the PR #4687 (comment). Help would be appreciated :)

dimitarvdimitrov · 2024-02-13T14:47:46Z

experimental support was added in #7282, so this was closed. There's follow-up work in #7368 and #7367 to promote this to stable

sojjan1337 · 2024-04-25T12:19:09Z

+1

dimitarvdimitrov added the helm label Jan 4, 2023

lukas-unity mentioned this issue Feb 1, 2023

Add HPA to Mimir Querier and Distributor #4133

Closed

3 tasks

jgutschon mentioned this issue Feb 11, 2023

Helm: Support creation of HorizontalPodAutoscaler #4229

Closed

3 tasks

dimitarvdimitrov linked a pull request Jul 19, 2023 that will close this issue

feat(helm): add keda autoscaling and fix dashboards #4687

Closed

3 tasks

jmichalek132 mentioned this issue Oct 13, 2023

chore: add support for not setting replicas helm chart #6373

Merged

3 tasks

beatkind mentioned this issue Feb 3, 2024

feat(helm): Adding KEDA autoscaling support #7282

Merged

4 tasks

dimitarvdimitrov closed this as completed in #7282 Feb 13, 2024

jakubsikorski mentioned this issue Apr 8, 2024

Native Kubernetes HPA in mimir-distributed #7846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for HPA in mimir-distributed #3430

Add support for HPA in mimir-distributed #3430

jgutschon commented Nov 10, 2022

Logiraptor commented Nov 11, 2022

Logiraptor commented Nov 11, 2022

jgutschon commented Nov 14, 2022

Logiraptor commented Nov 18, 2022

jgutschon commented Feb 11, 2023

jmichalek132 commented Oct 4, 2023

dimitarvdimitrov commented Oct 6, 2023

jmichalek132 commented Oct 9, 2023

dimitarvdimitrov commented Oct 9, 2023

jmichalek132 commented Oct 9, 2023

dimitarvdimitrov commented Oct 10, 2023

vaibhhavv commented Jan 5, 2024

dimitarvdimitrov commented Jan 8, 2024

dimitarvdimitrov commented Feb 13, 2024

sojjan1337 commented Apr 25, 2024

Add support for HPA in mimir-distributed #3430

Add support for HPA in mimir-distributed #3430

Comments

jgutschon commented Nov 10, 2022

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Logiraptor commented Nov 11, 2022

Logiraptor commented Nov 11, 2022

jgutschon commented Nov 14, 2022

Logiraptor commented Nov 18, 2022

jgutschon commented Feb 11, 2023

jmichalek132 commented Oct 4, 2023

dimitarvdimitrov commented Oct 6, 2023

jmichalek132 commented Oct 9, 2023

dimitarvdimitrov commented Oct 9, 2023

jmichalek132 commented Oct 9, 2023

dimitarvdimitrov commented Oct 10, 2023

vaibhhavv commented Jan 5, 2024

dimitarvdimitrov commented Jan 8, 2024

dimitarvdimitrov commented Feb 13, 2024

sojjan1337 commented Apr 25, 2024