Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to user-scheduler's coupling to the kube-scheduler binary #1778

Merged
merged 7 commits into from
Sep 21, 2020

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Sep 12, 2020

Closes #1773 which also contains the findings that made me do what I summarize having done below.

Important review

  • We should review this on a modern k8s cluster (1.17+) and verify that the user-scheduler does indeed schedule pods on the most busy node even with the new configuration style. I suggest this is done after merge though.

Summary of changes

  • Fixes a bug introduced by bumping to kube-scheduler binary of version 1.16 relating to not having the permission to look for CSINodes resources.
  • Bumping kube-spawner binary to the latest k8s version (1.19.1), but is able to fallback to 1.16.15) if the CSINode resource is not registered as a resource type in the k8s cluster.
  • Configures the modern kube-scheduler binary in its new configuration format.
  • Provides some notes as to when we can remove the fallback (when 1.17+ is assumed)

More detailed summary of changes

  • I made us mount the configuration as a file, which is what we need to do in the modern configuration format and it also made us no longer need to provide RBAC rules for inspecting configmaps.
  • I made the new plugins configurable for the advanced user wanting to try another scheduling policy.
  • I opted to fully specify the RBAC permissions instead of complement what could end up missing by augmenting the system:kube-scheduler and system:volume-scheduler ClusterRole's as doing this felt easier to maintain.
  • I removed a helm template function to get the current k8s version as a fallback for the kube-scheduler image tag which would be unsuitable to use and has not been used.

@consideRatio consideRatio changed the title y [WIP] Updates to user-scheduler's coupling to the kube-scheduler binary Sep 12, 2020
@consideRatio consideRatio marked this pull request as draft September 12, 2020 18:54
@consideRatio consideRatio changed the title [WIP] Updates to user-scheduler's coupling to the kube-scheduler binary Updates to user-scheduler's coupling to the kube-scheduler binary Sep 12, 2020
@consideRatio consideRatio marked this pull request as ready for review September 13, 2020 02:13
@consideRatio
Copy link
Member Author

I have the time to test this both on k8s 1.15 and on k8s 1.17 today if I merge this now, so I'll go ahead and do it.

@consideRatio consideRatio merged commit 3d6e757 into jupyterhub:master Sep 21, 2020
@consideRatio
Copy link
Member Author

consideRatio commented Sep 21, 2020

I've now reviewed this PR in a k8s 1.17 cluster, where it works but not yet good enough. It tends to schedule on one node over another, but I need to disable one more plugin that skews it back from only scheduling on the most resource intensive node.

Example scheduling logic - a pod is scored

Here is the scoring as indicated by the logs from the user-scheduler pod when scheduling a pod and one node had more pods scheduled on it than the other.

SelectorSpread             scores on pod1 => [{node1 27     } {node2 0      }]
NodeResourcesMostAllocated scores on pod1 => [{node1 18     } {node2 65     }]
ImageLocality              scores on pod1 => [{node1 0      } {node2 0      }]
InterPodAffinity           scores on pod1 => [{node1 0      } {node2 0      }]
NodeAffinity               scores on pod1 => [{node1 0      } {node2 0      }]
NodePreferAvoidPods        scores on pod1 => [{node1 1000000} {node2 1000000}]
PodTopologySpread          scores on pod1 => [{node1 200    } {node2 200    }]
TaintToleration            scores on pod1 => [{node1 100    } {node2 100    }]

To debug this like me yourself, I suggest:

  1. Configure to use only one user-scheduler.

    scheduling:
      userScheduler:
        enabled: true
        replicas: 1
    
  2. Scale up to 10 user placeholders.

    kubectl scale sts user-placeholder --replicas 10
    
  3. Inspect the distribution of pods.

    kubectl describe nodes
    
  4. Delete a pod and let it be recreated and rescheduled

    kubectl delete pod user-placeholder-0
    
  5. Inspect the logs and grep mentions of the specific user-placeholder pod

    kubectl logs -l component=user-scheduler | grep user-placeholder-0
    

SelectorSpread - disabling it

As we can see, the SelectorSpread plugin scores against the scoring by NodeResourcesMostAllocated which we don't want. I'll disable it in another PR.

ImageLocality - disabling it

For BinderHubs, it could potentially be nice to have ImageLocality - to schedule on nodes with a image available already, but that comes at the price of spreading out... I would argue its not worth it for BinderHubs either as it would block downscaling. I'll disable it also by default. Note that in the old kube-scheduler binary configuration, ImageLocality was not given priority over what was the equivalent of NodeResourcesMostAllocated either.

PodTopologySpread - leaving it enabled

This is a new feature of k8s 1.18, and I think if anyone use that it should be respected. See https://kubernetes.io/blog/2020/05/introducing-podtopologyspread/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kube-scheduler binary bump to 1.16 make user-scheduler in 1.15 k8s fail
1 participant