Add support for health check when draining nodes #1699

omerlh · 2020-01-06T13:04:29Z

Before creating a feature request, please search existing feature requests to see if you find a similar one. If there is a similar feature request please up-vote it and/or add your comments to it instead

Why do you want this feature?
When draining nodes on a production cluster, it might be safer to use a health check between node/nodegroup draining loops - to ensure that until now things go well. In case the health check failed, the draining process will be stopped. An example health check - how many ready pods exist out of all pods. If the number is below a given threshold, wait a few seconds before checking again and continue.

What feature/behavior/change do you want?
Allow to specify a health check command, or, as MVP, time to wait between nodes/nodegroup, maybe something like:

eksctl drain --wait-time 15s

Or, with health check:

eksctl drain --health 'kubectl get pods --all-namespaces | grep -v Running'

The text was updated successfully, but these errors were encountered:

TreverW · 2020-01-22T23:27:02Z

kubectl drain has an option for --grace-period=<n> that is somewhat helpful in not killing the old pods before the new ones are up. But it's only based on the time you specify. I would definitely like a way to be able to not delete the old pods before the new ones are up.

omerlh · 2020-01-26T20:05:13Z

I was aiming also to the ability to specify some grace period between nodes... I ended up writing my one drain script (can put it public if it's interesting), but I'll be happy to use official eksctl code.

andreamaruccia · 2020-04-23T10:01:09Z

as a workaround I had to use this command kubectl drain -l 'alpha.eksctl.io/nodegroup-name in (ng3-sandbox-1b,ng3-sandbox-1c,ng3-sandbox-1a)' --ignore-daemonsets --delete-local-data --grace-period -1

my suggestion here would be that eksctl drain would accept a switch --pod-grace-period which would default to -1 because this is the value which allows the drain process to respect the pod's grace termination period. See https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#drain -> grace-period

smrutiranjantripathy · 2020-05-03T18:16:04Z

I was aiming also to the ability to specify some grace period between nodes... I ended up writing my one drain script (can put it public if it's interesting), but I'll be happy to use official eksctl code.

You are looking for a feature so that there is a time gap between draining of subsequent nodes of a node group ? or you are looking for this flag "grace-period" implementation in eksctl side ?

I would like to work on this issue. Please clarify the above questions.

pandvan · 2020-06-18T10:10:19Z

You are looking for a feature so that there is a time gap between draining of subsequent nodes of a node group ? or you are looking for this flag "grace-period" implementation in eksctl side ?

Sorry to reply to a question aimed to another user by I'm just running in a similar situation, during the creation of a new node group followed by deletion of the old one.
Using eksctl drain nodegroup --cluster=XXX --name=YYY it immediately drain all nodes of the selected node group causing sloweness to bring up and running again deployments that have to be recreated on new nodes (also their docker images be downloaded, ...).
I'd looking for a way with eksctl to simulate this workflow:

cordon all nodes of the old node group, to be sure no pods will be scheduled there again.
drain 1 node at a time waiting some amount of time from one and the next.

In this way should be safer (and even faster) to move, for example, such apps that needs to form a cluster themselves (1 app cluster node per pod).
Otherwise I'll need to use kubectl as suggested before to achieve the same.

Himangini · 2021-11-09T10:39:35Z

closing this due to lack of activity

Skarlso · 2022-01-06T19:55:53Z

Since 2020.10.19 we are using proper eviction and cordoning for each node until they have pods on them. We are also doing the "Health check" thing by filtering for pods which can be evicted or deleted.

What I would like to understand here is if that is now working properly or does there still happen to be issues around it?

omerlh added the kind/feature New feature or request label Jan 6, 2020

martina-if added good first issue Good for newcomers help wanted Extra attention is needed labels Apr 23, 2020

This comment has been minimized.

Sign in to view

martina-if added the priority/backlog Not staffed at the moment. Help wanted. label Sep 11, 2020

rayterrill mentioned this issue Dec 1, 2021

Add node drain wait period #4506

Merged

7 tasks

Himangini closed this as completed Apr 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for health check when draining nodes #1699

Add support for health check when draining nodes #1699

omerlh commented Jan 6, 2020

TreverW commented Jan 22, 2020

omerlh commented Jan 26, 2020

andreamaruccia commented Apr 23, 2020 •

edited

Loading

smrutiranjantripathy commented May 3, 2020 •

edited

Loading

pandvan commented Jun 18, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Himangini commented Nov 9, 2021

Skarlso commented Jan 6, 2022

Add support for health check when draining nodes #1699

Add support for health check when draining nodes #1699

Comments

omerlh commented Jan 6, 2020

TreverW commented Jan 22, 2020

omerlh commented Jan 26, 2020

andreamaruccia commented Apr 23, 2020 • edited Loading

smrutiranjantripathy commented May 3, 2020 • edited Loading

pandvan commented Jun 18, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Himangini commented Nov 9, 2021

Skarlso commented Jan 6, 2022

andreamaruccia commented Apr 23, 2020 •

edited

Loading

smrutiranjantripathy commented May 3, 2020 •

edited

Loading