Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mimir HPA/autoscaling #3379

Closed
2 tasks
Tracked by #3039
QuentinBisson opened this issue Apr 2, 2024 · 6 comments
Closed
2 tasks
Tracked by #3039

Mimir HPA/autoscaling #3379

QuentinBisson opened this issue Apr 2, 2024 · 6 comments
Assignees

Comments

@QuentinBisson
Copy link

QuentinBisson commented Apr 2, 2024

Motivation

One reason that we switched to mimir is because it's way better when it comes to scalability and reliability. To fully use that advantage though we have to add autoscaling to our mimir instances. So let's learn how to do this.

Todo

  • introduce autoscaling for mimir to scale horizontally
  • check for feasible thresholds and limits and test it

Motivation

  • We know how to setup autoscaling for mimir and what makes sense and what doesn't to achieve Mimir High Availability.
  • We have a working PoC of autoscaling mimir that we can decide to keep or scrap.
@QuantumEnigmaa
Copy link

As mentioned in this issue, Mimir currently doesn't have any hpa in the chart except for the gateway (which is stateless). It does support Keda though.

Currently the scale down process must be down manually by removing the desired replica from the ingress so that it doesn't receive any new data, flush all its data to the object storage and then finally delete it. But as all manual process this could take longer than expected, especially if we happen to be in a situation where we have to regularly upscale / downscale the workload on several installations.

Because of this, I would suggest we contribute upstream to add hpa for the ingesters in the right way by taking example on the way it was done for loki : grafana/loki#8684

@QuantumEnigmaa
Copy link

PR for the distributor created : grafana/mimir#7839

Now heading for the ingesters PR

@QuantumEnigmaa QuantumEnigmaa self-assigned this Apr 8, 2024
@QuantumEnigmaa
Copy link

PR for the ingester : grafana/mimir#7843

@QuantumEnigmaa
Copy link

PR for the querier : grafana/mimir#7870

@QuantumEnigmaa
Copy link

Updates concerning the upstream PRs :

  • Ingester : the PR is blocked due to the fact that other contributors are still laying the foundations needed to perform horizontal autoscaling for the ingester. Therefore it's not possible currently to add hpa for the ingester and we'll need to wait for the advancement of the coding part.

  • querier : the PR was rejected because according to upstream maintainers, as Keda is already proposed as an horizontal autoscaling solution, adding yet another autoscaling solution in the form of the basic hpa is not necessary and would only come with additional maintenance work for upstream. Therefore they prefer users to rely on Keda when it comes to horizontal autoscaling for this component.

  • distributor : no activity on this PR but as the component is similar to the querier in terms of autoscaling (i.e already relying on keda), I guess the outcome will be the same.

Concerning the gateway component though, as the basic hpa is already supported upstream, I enabled it on golem to check whether its deployment would go smoothly and it appears to be the case. However, if we want to be sure we can rely on it, we should perform some load testing on the gateway while the hpa is deployed. @Rotfuks is taking care of creating a dedicated issue.

@QuantumEnigmaa
Copy link

Since upstream PRs have been rejected, we decided to create our own hpas for the distributor and the querier and have those enabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants