Updates to user-scheduler's coupling to the kube-scheduler binary #1778

consideRatio · 2020-09-12T18:53:23Z

Closes #1773 which also contains the findings that made me do what I summarize having done below.

Important review

We should review this on a modern k8s cluster (1.17+) and verify that the user-scheduler does indeed schedule pods on the most busy node even with the new configuration style. I suggest this is done after merge though.

Summary of changes

Fixes a bug introduced by bumping to kube-scheduler binary of version 1.16 relating to not having the permission to look for CSINodes resources.
Bumping kube-spawner binary to the latest k8s version (1.19.1), but is able to fallback to 1.16.15) if the CSINode resource is not registered as a resource type in the k8s cluster.
Configures the modern kube-scheduler binary in its new configuration format.
Provides some notes as to when we can remove the fallback (when 1.17+ is assumed)

More detailed summary of changes

I made us mount the configuration as a file, which is what we need to do in the modern configuration format and it also made us no longer need to provide RBAC rules for inspecting configmaps.
I made the new plugins configurable for the advanced user wanting to try another scheduling policy.
I opted to fully specify the RBAC permissions instead of complement what could end up missing by augmenting the system:kube-scheduler and system:volume-scheduler ClusterRole's as doing this felt easier to maintain.
I removed a helm template function to get the current k8s version as a fallback for the kube-scheduler image tag which would be unsuitable to use and has not been used.

consideRatio · 2020-09-21T07:47:20Z

I have the time to test this both on k8s 1.15 and on k8s 1.17 today if I merge this now, so I'll go ahead and do it.

consideRatio · 2020-09-21T12:23:40Z

I've now reviewed this PR in a k8s 1.17 cluster, where it works but not yet good enough. It tends to schedule on one node over another, but I need to disable one more plugin that skews it back from only scheduling on the most resource intensive node.

Example scheduling logic - a pod is scored

Here is the scoring as indicated by the logs from the user-scheduler pod when scheduling a pod and one node had more pods scheduled on it than the other.

SelectorSpread             scores on pod1 => [{node1 27     } {node2 0      }]
NodeResourcesMostAllocated scores on pod1 => [{node1 18     } {node2 65     }]
ImageLocality              scores on pod1 => [{node1 0      } {node2 0      }]
InterPodAffinity           scores on pod1 => [{node1 0      } {node2 0      }]
NodeAffinity               scores on pod1 => [{node1 0      } {node2 0      }]
NodePreferAvoidPods        scores on pod1 => [{node1 1000000} {node2 1000000}]
PodTopologySpread          scores on pod1 => [{node1 200    } {node2 200    }]
TaintToleration            scores on pod1 => [{node1 100    } {node2 100    }]

To debug this like me yourself, I suggest:

Configure to use only one user-scheduler.

scheduling:
  userScheduler:
    enabled: true
    replicas: 1

Scale up to 10 user placeholders.

kubectl scale sts user-placeholder --replicas 10

Inspect the distribution of pods.
```
kubectl describe nodes
```
Delete a pod and let it be recreated and rescheduled
```
kubectl delete pod user-placeholder-0
```
Inspect the logs and grep mentions of the specific user-placeholder pod
```
kubectl logs -l component=user-scheduler | grep user-placeholder-0
```

SelectorSpread - disabling it

As we can see, the SelectorSpread plugin scores against the scoring by NodeResourcesMostAllocated which we don't want. I'll disable it in another PR.

ImageLocality - disabling it

For BinderHubs, it could potentially be nice to have ImageLocality - to schedule on nodes with a image available already, but that comes at the price of spreading out... I would argue its not worth it for BinderHubs either as it would block downscaling. I'll disable it also by default. Note that in the old kube-scheduler binary configuration, ImageLocality was not given priority over what was the equivalent of NodeResourcesMostAllocated either.

PodTopologySpread - leaving it enabled

This is a new feature of k8s 1.18, and I think if anyone use that it should be respected. See https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/.

consideRatio changed the title y [WIP] Updates to user-scheduler's coupling to the kube-scheduler binary Sep 12, 2020

consideRatio marked this pull request as draft September 12, 2020 18:54

consideRatio changed the title ~~[WIP] Updates to user-scheduler's coupling to the kube-scheduler binary~~ Updates to user-scheduler's coupling to the kube-scheduler binary Sep 12, 2020

consideRatio force-pushed the kube-scheduler-rbac branch from c581968 to e131f07 Compare September 12, 2020 18:56

Updates to user-scheduler's coupling to the kube-scheduler binary

e487e22

consideRatio force-pushed the kube-scheduler-rbac branch from e131f07 to e487e22 Compare September 12, 2020 19:27

consideRatio added 4 commits September 13, 2020 00:21

user-scheduler: use k8s 1.19 kube-scheduler binary and its modern config

70f166b

docs: remove outdated instruction in CONTRIBUTING.md

d46233c

user-scheduler: fixes to kube-scheduler modern config

364ac0f

user-scheduler: fallback to modern kube-scheduler version and config

902fbd6

consideRatio force-pushed the kube-scheduler-rbac branch from 8149bec to 902fbd6 Compare September 13, 2020 01:48

user-scheduler: fallback note

4d0de4e

consideRatio marked this pull request as ready for review September 13, 2020 02:13

user-scheduler: removed outdated helper function

dda741a

consideRatio mentioned this pull request Sep 13, 2020

Maintenance: user-scheduler close to end of life as currently implemented #1730

Closed

consideRatio merged commit 3d6e757 into jupyterhub:master Sep 21, 2020

consideRatio mentioned this pull request Sep 21, 2020

user-scheduler: tweak modern configuration #1782

Merged

consideRatio mentioned this pull request Sep 28, 2020

Z2JH user-scheduler in need of maintenance pangeo-data/jupyter-earth#5

Closed

consideRatio added bug maintenance labels Oct 4, 2020

This was referenced Oct 7, 2020

Make user-scheduler utilize a binary based do the current k8s version #1301

Closed

Declare known unsupported versions in Chart.yaml #1493

Closed

consideRatio mentioned this pull request Oct 16, 2020

user-scheduler: localpath-provisioner issues caused by missing permission for persistentvolumes #1843

Closed

consideRatio mentioned this pull request Feb 5, 2021

z2jh 0.10.6 scale test failure with kube-scheduler v1.19.2 #2025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to user-scheduler's coupling to the kube-scheduler binary #1778

Updates to user-scheduler's coupling to the kube-scheduler binary #1778

consideRatio commented Sep 12, 2020 •

edited

Loading

consideRatio commented Sep 21, 2020

consideRatio commented Sep 21, 2020 •

edited

Loading

Updates to user-scheduler's coupling to the kube-scheduler binary #1778

Updates to user-scheduler's coupling to the kube-scheduler binary #1778

Conversation

consideRatio commented Sep 12, 2020 • edited Loading

Important review

Summary of changes

More detailed summary of changes

consideRatio commented Sep 21, 2020

consideRatio commented Sep 21, 2020 • edited Loading

Example scheduling logic - a pod is scored

SelectorSpread - disabling it

ImageLocality - disabling it

PodTopologySpread - leaving it enabled

consideRatio commented Sep 12, 2020 •

edited

Loading

consideRatio commented Sep 21, 2020 •

edited

Loading