Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helm: autoscaling migration procedure #7367

Open
Tracked by #7368
dimitarvdimitrov opened this issue Feb 12, 2024 · 9 comments · Fixed by #7431
Open
Tracked by #7368

helm: autoscaling migration procedure #7367

dimitarvdimitrov opened this issue Feb 12, 2024 · 9 comments · Fixed by #7431
Assignees
Labels
helm help wanted Extra attention is needed type/docs Improvements or additions to documentation

Comments

@dimitarvdimitrov
Copy link
Contributor

dimitarvdimitrov commented Feb 12, 2024

#7282 added autoscaling to Helm as an experimental feature. This issue is about adding support in the helm chart for a smooth migration and adding documentation for the migration.

Why do we need a migration?

Migrating to a Mimir cluster with autoscaling requires a few intermediate steps to ensure that there are no disruptions to traffic. The major risk is that enabling autoscaling also removed the replicas field from Deployments. If KEDA/HPA hasn't started autoscaling the Deployment, then k8s interprets no replicas as meaning 1 replica, which can cause an outage.

Migration in a nutshel

  1. Support a distributor.kedaAutoscaling.preserveReplicas: true field in the helm chart which doesn't delete the replicas field from the rendered manifests (feat(helm): Adding preserveReplicas option for kedaAutoscaling to preserve replicas even if keda is enabled #7431)
  2. Set preserveReplicas: true, deploy the chart.
  3. Wait for a couple of minutes for the HPA to start running (perhaps there's a concrete way to make sure everything is set up properly?)
  4. Remove preserveReplicas from values.yaml and deploy the chart
  5. Done

Internal docs

I'm also pasing Grafana Labs-internal documentation that's specific to our deployment tooling with FluxCD. Perhaps it can be used by folks running FluxCD or as a starting point for proper docs:

Autoscaling existing deployments

Who should use this and when

This tool and documentation helps describe how to transition from a manually
set number of replicas in a Kubernetes Deployment to using a Horizontal Pod
Autoscaler (HPA). It specifically highlights a pitfall of the migration and
how to work around it.

Not covered in this documentation is how to implement the HPA definition.
However, some examples can be found in existing jsonnet.

Problem

When autoscaling an existing Kubernetes Deployment there is a conflict between
what fluxcd wants to reconcile and what the HPA has set the replicas to.

Because flux is trying to reconcile the state we want to undefine any replicas
from the spec. Otherwise flux will reset the number of replicas back to the
supplied amount in conflict with what the autoscaler may have set them to.
This causes a continuous loop whereafter the HPA would again scale up/down and
then flux would reset again and so on.

Removing the replicas from the spec is straightforward enough. However, when
the updated spec is applied, Kubernetes notices the now missing item from the
spec and therefore applies the default of 1 replica
*.
Unless the HPA has scaled the deployment, the next reconciliation of flux
that removes .spec.replicas will result in the deployment being scaled
down. Sometimes this is not noticeable as the HPA will then kick in and scale
back up, but in other cases this can cause disruptions.

*This is not actually a bug despite the link and is intended behavour.

Solution

The goal is to be able to roll out autoscalers without the Deployment ever
being reset to 1 replica (even for a brief time). We also don't want to
maintain a state that is in conflict with reality. (ie we still want our
jsonnet to be truthful and thus remove the replicas from the spec).

What is happening internally in Kubernetes is to do with server side applies
and field ownership
.
The TL;DR for us is to tell Kubernetes that the HPA owns the replicas field,
and to not perform any action when it sees the replicas disappear from the
spec applied by flux.

We have observed that if the HPA modifies the number of replicas, it will
take ownership of the .spec.replicas field; after-which it is safe to remove
the replicas from our jsonnet definitions. However, if the HPA does not make
any modifications to the replicas (for example, when the correct number of
replicas already exist), or if you want to remove the jsonnet replicas
definition sooner, we need to tell Kubernetes who is managing the field.

We do this by removing the replicas from flux's managedFields metadata on the
Kubernetes object. This makes Kubernetes think that the replicas were never a
part of the spec (at least from flux), and thus the next time flux is applied
without the replicas defined it does not default back to 1. (This is because
the HPA is now the only owner of the field).

The helper script remove_managed_replicas.sh assists with editing this
metadata.

Process for switching to autoscalers

The process for switching an existing Deployment to use an autoscaler is as
follows:

  1. Add a flux ignore to the object. For example:
    ./scripts/flux/ignore.sh --context dev-us-east-0 --namespace cortex-dev-05 --kind Deployment --reason "Transitioning to HPA".
  2. Open and merge a PR with the change that installs the HPA (potentially via
    KEDA), as well as removes the .spec.replicas from the object.
  3. Remove the replicas from flux's managedFields metadata using
    remove_managed_replicas.sh (see below). If the HPA has already scaled the
    object (in either direction), the field will already be correctly updated.
    Either way, it is still safe to run the script which will confirm this for you
    as a no-op.
  4. Tidy up by removing the flux ignore on the object.
  5. When flux reconciles afterwards it should be a noop as .spec.replicas is
    not defined and the spec is otherwise unchanged from the object.
  6. Verify that flux has ran and not defaulted back to 1 replica.
  7. Verify that the HPA is deployed and working as expected.

How to use remove_managed_replicas.sh

Usage: ./remove_managed_replicas.sh [ -c | --context ]
                                    [ -n | --namespace ]
                                    [ -o | --object ]
                                    [ -d | --dry-run ]
                                    [ -h | --help  ]
Outputs a diff of changes made to the object.

Example:

./remove_managed_replicas.sh -c dev-eu-west-2 -n mimir-dev-10 -o Deployment/distributor --dry-run

The script will output a diff of the changes made to the object. (If --dry-run
is set, no actual changes are applied).

remove_managed_replicas.sh

#!/usr/bin/env bash

set -euo pipefail

help()
{
    echo "Usage: ./remove_managed_replicas.sh [ -c | --context ]
                                    [ -n | --namespace ]
                                    [ -o | --object ]
                                    [ -d | --dry-run ]
                                    [ -h | --help  ]
Outputs a diff of changes made to the object.
    "
    exit 2
}

VALID_ARGUMENTS=$#

if [ "$VALID_ARGUMENTS" -eq 0 ]; then
  help
fi

CONTEXT=""
NAMESPACE=""
OBJECT=""
DRY_RUN=false
DRY_RUN_ARG=""

while [ "$#" -gt 0 ]
do
  case "$1" in
    -c | --context )
      CONTEXT="--context=${2}"
      shift 2
      ;;
    -n | --namespace )
      NAMESPACE="$2"
      shift 2
      ;;
    -o | --object )
      OBJECT="$2"
      shift 2
      ;;
    -d | --dry-run )
      DRY_RUN=true
      DRY_RUN_ARG="--dry-run=server"
      shift 1
      ;;
    -h | --help)
      help
      ;;
    --)
      shift;
      break
      ;;
    *)
      echo "Unexpected option: ${1}"
      help
      ;;
  esac
done

if [ -z "${NAMESPACE}" ]
then
  echo "Must supply a namespace."
  exit 1
fi

if [ -z "${OBJECT}" ]
then
  echo "Must supply a kubernetes object (such as \`-o Deployment/example\`)."
  exit 1
fi

KC="kubectl ${CONTEXT} --namespace=${NAMESPACE}"

BEFORE=$(${KC} get "${OBJECT}" -o yaml --show-managed-fields=true)
BEFORE_JSON=$(${KC} get "${OBJECT}" -o json --show-managed-fields=true)

INDEX=$(echo "${BEFORE_JSON}" | jq '.metadata.managedFields | map(.manager == "kustomize-controller") | index(true)')

# Check we can find the position of flux's entry in managedFields:
if ! [[ $INDEX =~ ^[0-9]+$ ]]
then
  echo "Unable to find \`kustomize-controller\` (flux) in the managedFields metadata for ${OBJECT}."
  echo "Has flux not ran on this object before?"
  echo "This may happen if you have deployed the object manually."
  echo "It should be safe to continue removing the flux-ignore as the object was never managed by flux."
  exit 1
fi

# Check that `.spec.replicas` is set in the managedFields entry
CHECK=$(echo "${BEFORE_JSON}" | jq ".metadata.managedFields[${INDEX}].fieldsV1.\"f:spec\".\"f:replicas\"")

if [ "${CHECK}" = "null" ]
then
  echo "Unable to find \`.spec.replicas\` set in the managedFields metadata for \`kustomize-controller\`."
  echo "Has the field already been unset?"
  echo "This may happen if the HPA has already scaled the object."
  echo "It is safe to continue removing the flux-ignore on the object."
  exit 1
fi

AFTER=$(${KC} patch "${OBJECT}" -o yaml --show-managed-fields=true ${DRY_RUN_ARG} --type='json' -p "[{'op': 'remove', 'path': '/metadata/managedFields/${INDEX}/fieldsV1/f:spec/f:replicas'}]")

diff -u <(echo "${BEFORE}") <(echo "${AFTER}") || :

if [ "${DRY_RUN}" = true ]
then
  echo ""
  echo "Dry run only. No changes have been applied."
fi

exit 0

@dimitarvdimitrov dimitarvdimitrov added type/docs Improvements or additions to documentation helm help wanted Extra attention is needed labels Feb 12, 2024
@dimitarvdimitrov
Copy link
Contributor Author

dimitarvdimitrov commented Feb 13, 2024

@beatkind tagging you because I don't think I can assign you without you participating in the issue. Can you assign yourself by any chance? From the github docs

You can assign multiple people to each issue or pull request, including yourself, anyone who has commented on the issue or pull request, anyone with write permissions to the repository, and organization members with read permissions to the repository. For more information, see "Access permissions on GitHub."

@beatkind
Copy link
Contributor

beatkind commented Feb 13, 2024

@dimitarvdimitrov And I need to actively write here :) to be participating - nope I am not able to assign myself, because I do not have any permissions inside the repo

@dimitarvdimitrov
Copy link
Contributor Author

@beatkind I saw you marked this as fixed by #7431. I'd like to keep this issue open until we also document the migration procedure. Migration is now technically possible, but it's not a very user-friendly process because users needs to figure out the steps themselves.

@beatkind
Copy link
Contributor

beatkind commented Apr 2, 2024

@dimitarvdimitrov thanks for reopening, this was simply a mistake :) - I will add some documentation with my next PR

@KyriosGN0
Copy link
Contributor

hey @beatkind @dimitarvdimitrov , i just tried to follow Migration in a nutshel guide, i found 2 issues

  1. after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1
  2. after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff

@dimitarvdimitrov
Copy link
Contributor Author

1. after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1

how long after deploying the HPA did this happen? Was the HPA already working?

2. after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff

this looks like a problem with loadbalancing. The ring doesn't determine distributor load. Perhaps the reverse proxy in front of the distributor proxy didn't get the update quickly enough (or DNS was cached, etc.). What reverse proxy are you using - is it the nginx from the chart? Did this fix itself resolve eventually?

@KyriosGN0
Copy link
Contributor

  1. almost immediately (up to maybe a minute later), and yes the HPA was working
  2. i used a Servcie of type LoadBalancer (running GKE), so yeah i guess that GKE LB didn't route this correctly, i solved this manually by the only pod that received traffic

@dimitarvdimitrov
Copy link
Contributor Author

  1. almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

@beatkind
Copy link
Contributor

beatkind commented Aug 2, 2024

  1. almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

Good point, will validate it again with a test env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helm help wanted Extra attention is needed type/docs Improvements or additions to documentation
Projects
None yet
3 participants