Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Helm: Support creation of HorizontalPodAutoscaler #4229

Closed
wants to merge 23 commits into from

Conversation

jgutschon
Copy link
Contributor

@jgutschon jgutschon commented Feb 11, 2023

What this PR does

This adds a generic template for a HorizontalPodAutoscaler under templates/lib/hpa.tpl along with templating to generate HPAs for the following components:

  • alertmanager
  • compactor
  • distributor
  • ingester
  • querier
  • query-frontend
  • query-scheduler
  • store-gateway

The existing HPAs for the gateway and deprecated nginx pods have also been migrated to use this template. Notable changes resulting from this migration include that the behavior field may now be used, and the metrics target types have been changed to type: Utilization since these are scaling based on the percentage utilization of the metric. The requirements for using the autoscaling/v2 API have also been changed to K8s 1.23+ instead of 1.25+ since this API version is GA as of 1.23.

When using this with zone-aware components, individual HPAs are created per zone, and the hpa.minReplicas value is used instead of replicas in the zoneAwareReplicationMap template to ensure an even spread of pods at the minimum level.

Which issue(s) this PR fixes or relates to

Fixes #3430

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@CLAassistant
Copy link

CLAassistant commented Feb 11, 2023

CLA assistant check
All committers have signed the CLA.

@jgutschon jgutschon changed the title feat(helm): Hpa autoscaling Helm: Support creation of HorizontalPodAutoscaler Feb 11, 2023
@TaylorMutch
Copy link

How would autoscaling the ingesters work with zone awareness enabled?

@jgutschon
Copy link
Contributor Author

How would autoscaling the ingesters work with zone awareness enabled?

An HPA will be created for each zone if zone-awareness is enabled and then it will just scale within that zone based on the averageMemoryUtilization and averageCpuUtilization values set on the HPA. I've used this templating in our own deployment of mimir successfully with some extra behavior settings to avoid pod thrashing. For the ingester specifically, I set the scaleDown policy to disabled and instead handle downsizing manually due to the steps outlined here. It's still useful for us to have it able to scale up automatically however, so I've included here.

@TaylorMutch
Copy link

So just to confirm, if one of the statefulsets scales up by one, does it automatically scale up the other two (assuming 3 zones)? Otherwise you might have an imbalance

@jgutschon
Copy link
Contributor Author

Since each zone would have its own HPA, they will only scale within the respective zone, so yes, it could become imbalanced. Realistically though, if one zone had to scale up due to high memory usage, the others would likely scale as well since the resource usage is almost the same across each zone. Either way, I didn't think this would be a problem since the docs just recommend balancing zones to ensure resource utilization is also balanced.

@TaylorMutch
Copy link

Likely true; makes sense

@Logiraptor
Copy link
Contributor

@jgutschon Thanks for all this work! I've been out of office this week, so I've made a note to follow up when I'm back next week. I haven't finished completely reviewing / testing everything yet, but so far here are my thoughts in no particular order:

  • Internally we autoscale queriers based on the query scheduler queue size, since adding queriers won't help if the work to do is not shardable. Of course, this complicates things because now you need to query Mimir's own metrics, which means they need to be stored somewhere, and ideally that's not Mimir itself so you don't have a cascading failure.
  • I'm still a little nervous about autoscaling ingesters, specifically because automatically scaling down will lead to lost data by default. I would want to explicitly prevent downscaling somehow rather than relying on the users to configure it correctly. I think many users will see autoscaling and just flip all the booleans to true without reading deeper into it.
  • I noticed some fields changed in the gateway autoscaling, this is a breaking change and I'm wondering if we can keep the old naming to avoid that?
  • Thanks a ton for adding docs!
  • Have you tested the migration from current state to autoscaling? I think depending on how it all works out sometimes you can end up scaling to 0 temporarily, which would be bad for ingesters of course.
  • Depending on how shuffle sharding is configured, scaling up won't always make use of those extra replicas. This applies to several parts of the system, and I'm a little nervous about that edge case biting users over time. I wonder how we can avoid confusion here 🤔

Also tagging @krajorama and @dimitarvdimitrov in case they have thoughts :)

@jgutschon
Copy link
Contributor Author

Hi @Logiraptor, thanks for taking a look. I've reverted the naming for the gateway autoscaling fields and disabled the scaleDown policy for the ingester HPA. I've also added some of the scaling behavior settings that we have been using in our own custom deployment of mimir to prevent pod thrashing.

For the querier autoscaling, I understand that the Jsonnet autoscaling makes use of KEDA to scale based on this queue size as well as some other metrics, such as in this template. I was tentatively planning to add support for KEDA ScaledObjects in a later PR if this one was accepted, but I figured this would be a good baseline for that and it would still be useful to have autoscaling based on the memory and/or CPU for now.

I've just tested migrating a little bit now and I'm seeing that it's scaling down to 1 replica momentarily after creating the HPA. As far as I understand, I think a new revision of the Deployment or StatefulSet needs to be created at the same time as the HPA, omitting the replicas field, and this behavior will be prevented but I'm not 100% sure. I'll investigate more and see if this can be avoided another way since changing the replicas won't trigger a new revision to be created.

@krajorama krajorama self-requested a review February 26, 2023 18:03
@krajorama krajorama added the helm label Feb 27, 2023
@hajowieland
Copy link

Hy there, and thanks for your great work @jgutschon .
Any updates when this will be merged @krajorama @Logiraptor ?

@Logiraptor
Copy link
Contributor

@hajowieland No ETA yet. Currently we need to find a solution to the migration process. It's not viable to release this without one - many users including paying customers of Grafana Enterprise Metrics use this chart in production. The expectation is that all upgrades can be done with zero downtime. Even a momentary scale to 1 replica will cause a complete outage.

I think we could merge it sooner if it includes only querier and distributor auto scaling, since as I mentioned previously these two are in production use at scale internally already, they're stateless, and cheap to spin up / down.

Auto scaling everything without extensive testing is risky because while all components are designed to scale horizontally, that doesn't necessarily mean they can be scaled often. Just to take a random example: compactors tend to have spiky demand as new blocks are created. So you may think that compactors can autoscale up every two hours, then back down in between. This might work, but it needs to be tested at scale. Compactors only re-plan the job sharding every so often (~ every 1h), so they will not immediately distribute work if you add 10 compactors for example. More likely, they will all compact the same blocks at first, then after 1h they will notice that more compactors have joined. So if you autoscale up and down constantly, I don't think they will make efficient use of resources.

That's just one example - each component needs to be tested individually at scale.

@pstibrany
Copy link
Member

The CHANGELOG has just been cut to prepare for the next Mimir release. Please rebase main and eventually move the CHANGELOG entry added / updated in this PR to the top of the CHANGELOG document. Thanks!

@pracucci
Copy link
Collaborator

@jgutschon @krajorama Hi! I'm checking draft PRs. What's the state of this PR? Should we move forward or close it?

@jgutschon
Copy link
Contributor Author

jgutschon commented Oct 11, 2023

Hi @pracucci, taking another look at this now, it seems that there are some k8s docs on the migration issues mentioned above: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#migrating-deployments-and-statefulsets-to-horizontal-autoscaling

I've just tried the steps in the client-side option after enabling the HPA and it worked successfully without scaling down the pods.

Worth noting here as well that I had originally solved this with ArgoCD by ignoring the replicas field entirely with the following config on the Application manifest, however this isn't a generalized solution for Helm users:

ignoreDifferences:
  - group: apps
    kind: Deployment
    managedFieldsManagers:
      - kube-controller-manager

I'll see if I have some time in the next few days to update the docs with the extra migration steps and address some of @Logiraptor's points, would be happy if this could still be merged.

@dimitarvdimitrov
Copy link
Contributor

worth noting that there's another PR attempting to introduce HPA into the helm chart #4687. That uses KEDA and the same queries for autoscaling that the jsonnet library uses. I am not sure if @davidspek is still available to work on it.

I think it will be easier to maintain and provide support for if all the deployment tooling deployed similar autoscaling methods.

As @Logiraptor mentioned, autoscaling of different components can be very different. We're adding autoscaling of more components to jsonnet and internally testing it out extensively. Perhaps starting with some already tested components like the distributor, querier, query-frontend is a good starting point.

Regarding migrations: I think we might have some (internal?) docs for how to do the migration to KEDA autoscaling that we can share

@pracucci
Copy link
Collaborator

worth noting that there's another PR attempting to introduce HPA into the helm chart #4687

I'm closing this PR in favour of #4687 which looks has received more progress recently, but please feel free to re-open this PR if work gets resumed.

@pracucci pracucci closed this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for HPA in mimir-distributed
9 participants