Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for HPA in mimir-distributed #3430

Closed
jgutschon opened this issue Nov 10, 2022 · 15 comments · Fixed by #7282
Closed

Add support for HPA in mimir-distributed #3430

jgutschon opened this issue Nov 10, 2022 · 15 comments · Fixed by #7282
Labels

Comments

@jgutschon
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I have been running Grafana Mimir in our infrastructure using the Helm chart and I'm not currently able to automatically scale mimir in response to changes in load on the servers.

Describe the solution you'd like

It would be very helpful if Horizontal Pod Autoscalers were supported natively in the mimir-distributed helm chart.

Describe alternatives you've considered

I've attempted to add this manually using custom manifests with the chart to create HPAs for each component (Distributor, Querier, etc.), however, this is not an optimal solution since the replicas field on each Deployment and StatefulSet cannot be removed with the current templating. When using a custom HPA with the helm chart, applying any changes to one of the Deployments/StatefulSets causes a scale down to the number specified by the replicas field, followed by a scale up once the HPA has updated the desired replicas. In production, this is not ideal since it will potentially terminate active pods and disrupt our cluster's monitoring.

As of now, my best solution is to fork the chart and add templating to omit this field, but it would be nice if this could be officially supported.

@Logiraptor
Copy link
Contributor

Thanks for opening this issue. I would be happy to review a PR for this if you have time.

At the moment, only two components have had enough testing with autoscaling to support officially: querier and distributor. We're working on autoscaling other components, but need time to properly test the stability in our production environments. You can see the Jsonnet implementation for distributor and querier here:

What do you think about opening a PR to start?

@Logiraptor
Copy link
Contributor

By the way, that jsonnet template is rendered here to plain yaml in case Jsonnet is not something you're familiar with:

@jgutschon
Copy link
Contributor Author

Thanks, I'll consider opening a PR if I can find some time.

At the moment, only two components have had enough testing with autoscaling to support officially: querier and distributor.

What kind of testing is usually done on these? Based on the scaling docs I was under the impression that any component could be scaled up and down safely (besides some exceptions for scaling down alertmanager, ingester, and store-gateway).

By the way, that jsonnet template is rendered here to plain yaml in case Jsonnet is not something you're familiar with:

Not super familiar with Jsonnet so this is definitely helpful, thanks.

@Logiraptor
Copy link
Contributor

What kind of testing is usually done on these?

We will typically run these kinds of things in production at Grafana Cloud for at least a few weeks to uncover anything unexpected. That doesn't mean we can't start experimenting, but it would need to be marked experimental to let users know we haven't run at scale yet.

Based on the scaling docs I was under the impression that any component could be scaled up and down safely

You're right that most components can be scaled up and down. Specifically I'm referring to autoscaling, which can mean scaling up or down much more often than a human operator would ever do. For example, we've needed to fine tune shutdown behavior for queriers. It's not so big a deal to close a connection once in a while when a human scales down, but if a machine is scaling queriers every few minutes, then that single connection closed error can start to have a significant impact on the service availability, even if it doesn't lead to data loss or any permanent issues.

@jgutschon
Copy link
Contributor Author

Hi @Logiraptor, I've opened #4229 with some additions to create HPAs for the rest of the components since #4133 does not fully address this issue. Would love some feedback on this if you have some time to take a look, thanks.

@dimitarvdimitrov dimitarvdimitrov linked a pull request Jul 19, 2023 that will close this issue
3 tasks
@jmichalek132
Copy link
Contributor

Hi, I would like to ask if I raise a PR, which would allow not setting the replicas in the helm charts on components such as distributor, querier, and query-frontend would it be accepted? There doesn't seem to be any work going on in the linked PR recently. I tried adding HPA via extra objects here, https://github.com/grafana/mimir/blob/main/operations/helm/charts/mimir-distributed/values.yaml#L3650. But since we used argocd it would revert the changes to the replicas done by the HPA, this would also happen with every deploy when just using helm. Allowing to not set the replicas count in the helm chart would than prevent this from happening.

@dimitarvdimitrov
Copy link
Contributor

that makes sense to me @jmichalek132, I think we had the same problem with HPA migrations in jsonnet. How would this be implemented? Perhaps a special value for replicas - null that is detected in the distributor, querier, etc. templates and it omits the value? Or do you have something else in mind?

@jmichalek132
Copy link
Contributor

that makes sense to me @jmichalek132, I think we had the same problem with HPA migrations in jsonnet. How would this be implemented? Perhaps a special value for replicas - null that is detected in the distributor, querier, etc. templates and it omits the value? Or do you have something else in mind?

I think that sounds reasonable should I raise a PR for it?

@dimitarvdimitrov
Copy link
Contributor

yes, thank you!

@jmichalek132
Copy link
Contributor

yes, thank you!

So I did some testing,

I started with

  {{- if ne .Values.distributor.replicas "null"  }}
  replicas: {{ .Values.distributor.replicas }}
  {{- end }}

but helm complains when it's not null:

Error: template: mimir-distributed/templates/distributor/distributor-dep.yaml:11:9: executing "mimir-distributed/templates/distributor/distributor-dep.yaml" at <ne .Values.distributor.replicas "null">: error calling ne: incompatible types for comparison

So I tried:

  {{- if .Values.distributor.replicas }}
  replicas: {{ .Values.distributor.replicas }}
  {{- end }}

when replicas is a number i.e. the default 1 it works.
When it's set to null the replicas is not set, good.
But when it's set to 0 it's also unset. Now this is not a common case imho but it would break.
Would that be okay or should I look for another way to do it?

@dimitarvdimitrov
Copy link
Contributor

can you open a draft PR and we can move the discussion there? I think this is somewhat tangential to the issue discussed here. Alternatively we can also keep the discussion going in another issue.

@vaibhhavv
Copy link

Hi @jmichalek132, @dimitarvdimitrov any update on the progress of the autoscaling features? as whole mimir community is looking forward to it as it will provide more stability to the components

@dimitarvdimitrov
Copy link
Contributor

#4687 is the furthest we have gone with autoscaling in the helm chart. That PR hasn't progressed a ton lately. This comment has some notes on what I see as the next steps for the PR #4687 (comment). Help would be appreciated :)

@dimitarvdimitrov
Copy link
Contributor

experimental support was added in #7282, so this was closed. There's follow-up work in #7368 and #7367 to promote this to stable

@sojjan1337
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment