Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for health check when draining nodes #1699

Closed
omerlh opened this issue Jan 6, 2020 · 10 comments
Closed

Add support for health check when draining nodes #1699

omerlh opened this issue Jan 6, 2020 · 10 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/feature New feature or request priority/backlog Not staffed at the moment. Help wanted.

Comments

@omerlh
Copy link

omerlh commented Jan 6, 2020

Before creating a feature request, please search existing feature requests to see if you find a similar one. If there is a similar feature request please up-vote it and/or add your comments to it instead

Why do you want this feature?
When draining nodes on a production cluster, it might be safer to use a health check between node/nodegroup draining loops - to ensure that until now things go well. In case the health check failed, the draining process will be stopped. An example health check - how many ready pods exist out of all pods. If the number is below a given threshold, wait a few seconds before checking again and continue.

What feature/behavior/change do you want?
Allow to specify a health check command, or, as MVP, time to wait between nodes/nodegroup, maybe something like:

eksctl drain --wait-time 15s

Or, with health check:

eksctl drain --health 'kubectl get pods --all-namespaces | grep -v Running'
@omerlh omerlh added the kind/feature New feature or request label Jan 6, 2020
@TreverW
Copy link

TreverW commented Jan 22, 2020

kubectl drain has an option for --grace-period=<n> that is somewhat helpful in not killing the old pods before the new ones are up. But it's only based on the time you specify. I would definitely like a way to be able to not delete the old pods before the new ones are up.

@omerlh
Copy link
Author

omerlh commented Jan 26, 2020

I was aiming also to the ability to specify some grace period between nodes... I ended up writing my one drain script (can put it public if it's interesting), but I'll be happy to use official eksctl code.

@andreamaruccia
Copy link

andreamaruccia commented Apr 23, 2020

as a workaround I had to use this command kubectl drain -l 'alpha.eksctl.io/nodegroup-name in (ng3-sandbox-1b,ng3-sandbox-1c,ng3-sandbox-1a)' --ignore-daemonsets --delete-local-data --grace-period -1

my suggestion here would be that eksctl drain would accept a switch --pod-grace-period which would default to -1 because this is the value which allows the drain process to respect the pod's grace termination period. See https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain -> grace-period

@martina-if martina-if added good first issue Good for newcomers help wanted Extra attention is needed labels Apr 23, 2020
@smrutiranjantripathy
Copy link
Contributor

smrutiranjantripathy commented May 3, 2020

I was aiming also to the ability to specify some grace period between nodes... I ended up writing my one drain script (can put it public if it's interesting), but I'll be happy to use official eksctl code.

You are looking for a feature so that there is a time gap between draining of subsequent nodes of a node group ? or you are looking for this flag "grace-period" implementation in eksctl side ?

I would like to work on this issue. Please clarify the above questions.

@pandvan
Copy link

pandvan commented Jun 18, 2020

You are looking for a feature so that there is a time gap between draining of subsequent nodes of a node group ? or you are looking for this flag "grace-period" implementation in eksctl side ?

Sorry to reply to a question aimed to another user by I'm just running in a similar situation, during the creation of a new node group followed by deletion of the old one.
Using eksctl drain nodegroup --cluster=XXX --name=YYY it immediately drain all nodes of the selected node group causing sloweness to bring up and running again deployments that have to be recreated on new nodes (also their docker images be downloaded, ...).
I'd looking for a way with eksctl to simulate this workflow:

  • cordon all nodes of the old node group, to be sure no pods will be scheduled there again.
  • drain 1 node at a time waiting some amount of time from one and the next.

In this way should be safer (and even faster) to move, for example, such apps that needs to form a cluster themselves (1 app cluster node per pod).
Otherwise I'll need to use kubectl as suggested before to achieve the same.

@Hendisumarna467

This comment has been minimized.

@Hendisumarna467

This comment has been minimized.

@Hendisumarna467

This comment has been minimized.

@martina-if martina-if added the priority/backlog Not staffed at the moment. Help wanted. label Sep 11, 2020
@Himangini
Copy link
Collaborator

closing this due to lack of activity

@Skarlso
Copy link
Contributor

Skarlso commented Jan 6, 2022

Since 2020.10.19 we are using proper eviction and cordoning for each node until they have pods on them. We are also doing the "Health check" thing by filtering for pods which can be evicted or deleted.

What I would like to understand here is if that is now working properly or does there still happen to be issues around it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed kind/feature New feature or request priority/backlog Not staffed at the moment. Help wanted.
Projects
None yet
Development

No branches or pull requests

9 participants