Maintenance: user-scheduler close to end of life as currently implemented #1730

consideRatio · 2020-07-23T08:08:48Z

The user-scheduler should keep working on various k8s versions, but its currently using a k8s 1.13 binary of kube-scheduler which was the only version I got functioning with the current configuration which may have become deprecated in future versions of the kube-scheduler binary (#1483).

@yuvipanda also suspects it can cause performance issues for us, which is relevant when many users arrive at the same time and/or there are many pods.

I think to resolve this, there may be some new tech in k8s that we could make use of. Perhaps the new kube-scheduler binary can be configured by ConfigMaps and be adjusted to schedule pods with certain labels in certain ways etc all by simply adding a certain kind of configuration resource? I don't know - but I think such options may have come from more recent k8s versions.

Anyhow, this issue is opened to reflect the need to resolve this.

Idea: assign an anti-affinity for the "currently least used node" to a singleuser pod when it is launched. The normal scheduler would then try to assign it to a node that isn't "the least used node". This keeps the "least used node" free of pods, increasing its chances of becoming empty and getting culled. Node anti-affinity is soft, so the scheduler would use the "least used node" before requesting a new node to be started.

WDYT?

consideRatio · 2020-07-30T22:54:49Z

@betatim that's an interesting idea, thanks for brainstorming about this! It is an idea that is better now that the performance has increased greatly (two orders of magnitude) for scheduling soft affinities since we considered these options. I like how it would make the addition independent of the kube-scheduler binary's API that changes over time and possibly also the k8s API and requirements of a scheduler in general.

My interpretation of the idea

We let our user pods have pre-defined preferred affinities and preferred anti-affinities for node labels, for example towards labels hub.jupyter.org/user-attraction: positive and hub.jupyter.org/user-attraction: negative.
We could also make use of a required anti-affinity to forbid scheduling if we want.
We develop an in-cluster software to run in some pod(s), which inspect if a node should be given an attraction given arbitrary criteria.

I think the key here is to consider what criteria we would like to use to label nodes, and if we think that implementing such labeller logic make more sense, or complements the logic of having a custom kube-scheduler configured in the cluster.

Brainstormed spin off - adjusted culling

We could make the culler act smarter, culling pods harshly on stricter nodes.

Brainstormed spin off - informed culling through a user nudge API

We could develop an API to inform a jupyter server that it's requested to be restarted by JupyterHub or similarly, where JupyterLab for example being aware of this API could present a popup to ask the user to restart or similarly.

betatim · 2020-07-31T04:43:41Z

We let our user pods have pre-defined preferred affinities and preferred anti-affinities for node labels, for example towards labels hub.jupyter.org/user-attraction: positive and hub.jupyter.org/user-attraction: negative

This isn't exactly what I had in mind. As I understand it and use it the main job of the user-scheduler now is to keep one node as empty as possible in order to help the cluster auto-scaler to scale down the cluster. For this use-case it seems like we could do something like

https://github.com/jupyterhub/binderhub/blob/a168d069772012c52f9ac7056ec22d779927ae69/binderhub/build.py#L176-L193

but instead of a affinity for the best node, use an anti-affinity for the "worst node". When spawning a new singleuser pod we'd determine the name of the "worst node" and set that in the anti-affinity. To improve performance we might want to cache the determination of the "worst node" for a few seconds/minutes.

yuvipanda · 2020-07-31T07:37:59Z

@consideRatio I had something like what you mentioned in https://github.com/berkeley-dsep-infra/datahub/tree/staging/images/rebalancer. I've turned it off, it was too flaky + did weird things when you had multiple hubs on the same cluster. Also I figured I'd have to implement a bunch more logic for it to work. As a short example,

Dealing with cordoning / unready nodes
Respect Topologies, especially for volume provisioning / mounting
Don't end up in weird loops when the node we pick is too small / doesn't have enough resources / etc.

yuvipanda · 2020-09-09T04:54:38Z

We're now on 1.16 I believe?

consideRatio · 2020-09-09T07:11:54Z

The kube-sceduler binary was recently bumped to its release associated with k8s 1.16 yepp, and it worked for k8s 1.16+. #1773 describes one fix needed for it to function onolder versions which is easy.

I thought it would break in more complicated ways when bumping it, but perhaps this is all thats needed for a while.

yuvipanda · 2020-09-09T07:19:47Z

I am on master of z2jh (of course :D), and I think it might be causing other issues wrt scheduling. I'll keep an eye out and report back.

consideRatio · 2020-09-13T02:38:20Z

Regarding the original post of unpinning our old version of kube-scheduler used by the user-scheduler.

I've worked on #1778. In that PR, we now use kube-scheduler 1.19 in k8s clusters at 1.17 or potentially lower k8s version in, which is automatically detected. Then we fallback to a 1.16 version of kube-scheduler which is also made to fix a bug in k8s 1.15 and lower cluster versions.

With a PR like that merged and tested I consider this issue to be handled.

consideRatio · 2020-09-21T14:09:10Z

We now use kube-scheduler v1.19.1 when the k8s cluster is modern enough to allow for it, and the scheduling is allocating properly as before, and I've now made it less prone to issues with future maintenance. I consider the original issue resolved with this!

@betatim let's try remember there were some interesting ideas here regarding a BinderHub's scheduling needs that one could want to consider here, hmm...

consideRatio added the enhancement label Jul 23, 2020

consideRatio changed the title ~~Maintenance: user-scheduler optimizations~~ Maintenance: user-scheduler close to end of life as currently implemented Jul 23, 2020

consideRatio mentioned this issue Jul 28, 2020

Z2JH user-scheduler in need of maintenance pangeo-data/jupyter-earth#5

Closed

consideRatio closed this as completed Sep 21, 2020

consideRatio mentioned this issue Oct 7, 2020

user-scheduler caught in loop fighting with default-scheduler #1592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintenance: user-scheduler close to end of life as currently implemented #1730

Maintenance: user-scheduler close to end of life as currently implemented #1730

consideRatio commented Jul 23, 2020 •

edited

Loading

betatim commented Jul 30, 2020

consideRatio commented Jul 30, 2020 •

edited

Loading

betatim commented Jul 31, 2020

yuvipanda commented Jul 31, 2020

yuvipanda commented Sep 9, 2020

consideRatio commented Sep 9, 2020

yuvipanda commented Sep 9, 2020

consideRatio commented Sep 13, 2020

consideRatio commented Sep 21, 2020

Maintenance: user-scheduler close to end of life as currently implemented #1730

Maintenance: user-scheduler close to end of life as currently implemented #1730

Comments

consideRatio commented Jul 23, 2020 • edited Loading

Related

betatim commented Jul 30, 2020

consideRatio commented Jul 30, 2020 • edited Loading

My interpretation of the idea

Brainstormed spin off - adjusted culling

Brainstormed spin off - informed culling through a user nudge API

betatim commented Jul 31, 2020

yuvipanda commented Jul 31, 2020

yuvipanda commented Sep 9, 2020

consideRatio commented Sep 9, 2020

yuvipanda commented Sep 9, 2020

consideRatio commented Sep 13, 2020

consideRatio commented Sep 21, 2020

consideRatio commented Jul 23, 2020 •

edited

Loading

consideRatio commented Jul 30, 2020 •

edited

Loading