Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintenance: user-scheduler close to end of life as currently implemented #1730

Closed
consideRatio opened this issue Jul 23, 2020 · 9 comments
Closed

Comments

@consideRatio
Copy link
Member

consideRatio commented Jul 23, 2020

The user-scheduler should keep working on various k8s versions, but its currently using a k8s 1.13 binary of kube-scheduler which was the only version I got functioning with the current configuration which may have become deprecated in future versions of the kube-scheduler binary (#1483).

@yuvipanda also suspects it can cause performance issues for us, which is relevant when many users arrive at the same time and/or there are many pods.

I think to resolve this, there may be some new tech in k8s that we could make use of. Perhaps the new kube-scheduler binary can be configured by ConfigMaps and be adjusted to schedule pods with certain labels in certain ways etc all by simply adding a certain kind of configuration resource? I don't know - but I think such options may have come from more recent k8s versions.


Anyhow, this issue is opened to reflect the need to resolve this.

Related

@consideRatio consideRatio changed the title Maintenance: user-scheduler optimizations Maintenance: user-scheduler close to end of life as currently implemented Jul 23, 2020
@betatim
Copy link
Member

betatim commented Jul 30, 2020

What do you think of using node-(anti)affinities to replace the scheduler? In BinderHub we use node affinity to try and have the same build pod scheduled on the same node. I was thinking we could extend this technique to user pods.

Idea: assign an anti-affinity for the "currently least used node" to a singleuser pod when it is launched. The normal scheduler would then try to assign it to a node that isn't "the least used node". This keeps the "least used node" free of pods, increasing its chances of becoming empty and getting culled. Node anti-affinity is soft, so the scheduler would use the "least used node" before requesting a new node to be started.

WDYT?

@consideRatio
Copy link
Member Author

consideRatio commented Jul 30, 2020

@betatim that's an interesting idea, thanks for brainstorming about this! It is an idea that is better now that the performance has increased greatly (two orders of magnitude) for scheduling soft affinities since we considered these options. I like how it would make the addition independent of the kube-scheduler binary's API that changes over time and possibly also the k8s API and requirements of a scheduler in general.

My interpretation of the idea

  1. We let our user pods have pre-defined preferred affinities and preferred anti-affinities for node labels, for example towards labels hub.jupyter.org/user-attraction: positive and hub.jupyter.org/user-attraction: negative.
  2. We could also make use of a required anti-affinity to forbid scheduling if we want.
  3. We develop an in-cluster software to run in some pod(s), which inspect if a node should be given an attraction given arbitrary criteria.

I think the key here is to consider what criteria we would like to use to label nodes, and if we think that implementing such labeller logic make more sense, or complements the logic of having a custom kube-scheduler configured in the cluster.

Brainstormed spin off - adjusted culling

We could make the culler act smarter, culling pods harshly on stricter nodes.

Brainstormed spin off - informed culling through a user nudge API

We could develop an API to inform a jupyter server that it's requested to be restarted by JupyterHub or similarly, where JupyterLab for example being aware of this API could present a popup to ask the user to restart or similarly.

@betatim
Copy link
Member

betatim commented Jul 31, 2020

We let our user pods have pre-defined preferred affinities and preferred anti-affinities for node labels, for example towards labels hub.jupyter.org/user-attraction: positive and hub.jupyter.org/user-attraction: negative

This isn't exactly what I had in mind. As I understand it and use it the main job of the user-scheduler now is to keep one node as empty as possible in order to help the cluster auto-scaler to scale down the cluster. For this use-case it seems like we could do something like

https://github.com/jupyterhub/binderhub/blob/a168d069772012c52f9ac7056ec22d779927ae69/binderhub/build.py#L176-L193

but instead of a affinity for the best node, use an anti-affinity for the "worst node". When spawning a new singleuser pod we'd determine the name of the "worst node" and set that in the anti-affinity. To improve performance we might want to cache the determination of the "worst node" for a few seconds/minutes.

@yuvipanda
Copy link
Collaborator

@consideRatio I had something like what you mentioned in https://github.com/berkeley-dsep-infra/datahub/tree/staging/images/rebalancer. I've turned it off, it was too flaky + did weird things when you had multiple hubs on the same cluster. Also I figured I'd have to implement a bunch more logic for it to work. As a short example,

  1. Dealing with cordoning / unready nodes
  2. Respect Topologies, especially for volume provisioning / mounting
  3. Don't end up in weird loops when the node we pick is too small / doesn't have enough resources / etc.

@yuvipanda
Copy link
Collaborator

We're now on 1.16 I believe?

@consideRatio
Copy link
Member Author

The kube-sceduler binary was recently bumped to its release associated with k8s 1.16 yepp, and it worked for k8s 1.16+. #1773 describes one fix needed for it to function onolder versions which is easy.

I thought it would break in more complicated ways when bumping it, but perhaps this is all thats needed for a while.

@yuvipanda
Copy link
Collaborator

I am on master of z2jh (of course :D), and I think it might be causing other issues wrt scheduling. I'll keep an eye out and report back.

@consideRatio
Copy link
Member Author

Regarding the original post of unpinning our old version of kube-scheduler used by the user-scheduler.

I've worked on #1778. In that PR, we now use kube-scheduler 1.19 in k8s clusters at 1.17 or potentially lower k8s version in, which is automatically detected. Then we fallback to a 1.16 version of kube-scheduler which is also made to fix a bug in k8s 1.15 and lower cluster versions.

With a PR like that merged and tested I consider this issue to be handled.

@consideRatio
Copy link
Member Author

We now use kube-scheduler v1.19.1 when the k8s cluster is modern enough to allow for it, and the scheduling is allocating properly as before, and I've now made it less prone to issues with future maintenance. I consider the original issue resolved with this!

@betatim let's try remember there were some interesting ideas here regarding a BinderHub's scheduling needs that one could want to consider here, hmm...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants