Skip to content
This repository has been archived by the owner on Dec 2, 2021. It is now read-only.

Latest commit

 

History

History
243 lines (195 loc) · 11 KB

pod-priority-api.md

File metadata and controls

243 lines (195 loc) · 11 KB

Priority in Kubernetes API

@bsalamat

May 2017

Objective

  • How to specify priority for workloads in Kubernetes API.
  • Define how the order of these priorities are specified.
  • Define how new priority levels are added.
  • Effect of priority on scheduling and preemption.

Non-Goals

  • How preemption works in Kubernetes.
  • How quota allocation and accounting works for each priority.

Background

It is fairly common in clusters to have more tasks than what the cluster resources can handle. Often times the workload is a mix of high priority critical tasks, and non-urgent tasks that can wait. Cluster management should be able to distinguish these workloads in order to decide which ones should acquire the resources sooner and which ones can wait. Priority of the workload is one of the key metrics that provides the information to the cluster. This document is a more detailed design proposal for part of the high-level architecture described in Resource sharing architecture for batch and serving workloads in Kubernetes.

Overview

This design doc introduces the concept of priorities for pods in Kubernetes and how the priority impacts scheduling and preemption of pods when the cluster runs out of resources. A pod can specify a priority at the creation time. The priority must be one of the valid values and there is a total order on the values. The priority of a pod is independent of its workload type. The priority is global and not specific to a particular namespace.

Detailed Design

Effect of priority on scheduling

One could generally expect a pod with higher priority has a higher chance of getting scheduled than the same pod with lower priority. However, there are many other parameters that affect scheduling decisions. So, a high priority pod may or may not be scheduled before lower priority pods. The details of what determines the order at which pods are scheduled are beyond the scope of this document.

Effect of priority on preemption

Generally, lower priority pods are more likely to get preempted by higher priority pods when cluster has reached a threshold. In such a case, scheduler may decide to preempt lower priority pods to release enough resources for higher priority pending pods. As mentioned before, there are many other parameters that affect scheduling decisions, such as affinity and anti-affinity. If scheduler determines that a high priority pod cannot be scheduled even if lower priority pods are preempted, it will not preempt lower priority pods. Scheduler may have other restrictions on preempting pods, for example, it may refuse to preempt a pod if PodDisruptionBudget is violated. The details of scheduling and preemption decisions are beyond the scope of this document.

Priority in PodSpec

Pods may have priority in their pod spec. PodSpec will have two new fields called "PriorityClassName" which is specified by user, and "Priority" which will be populated by Kubernetes. User-specified priority (PriorityClassName) is a string and all of the valid priority classes are defined by a system wide mapping that maps each string to an integer. The PriorityClassName specified in a pod spec must be found in this map or the pod creation request will be rejected. If PriorityClassName is empty, it will resolve to the default priority (See below for more info on name resolution). Once the PriorityClassName is resolved to an integer, it is placed in "Priority" field of PodSpec.

type PodSpec struct {
  ...
  PriorityClassName string
  Priority          *int32  // Populated by Admission Controller. Users are not allowed to set it directly.
}

Priority Classes

The cluster may have many user defined priority classes for various use cases. The following list is an example of how the priorities and their values may look like. Kubernetes will also have special priority class names reserved for critical system pods. Please see System Priority Class Names for more information. Any priority value above 1 billion is reserved for system use. Aside from those system priority classes, Kubernetes is not shipped with predefined priority classes usable by user pods. The main goal of having no built-in priority classes for user pods is to avoid creating defacto standard names which may be hard to change in the future.

system  2147483647 (int_max)
tier1   4000
tier2   2000
tier3   1000

The following shows a list of example workloads in a Kubernetes cluster in decreasing order of priority:

  • Kubernetes system daemons (per-node like fluentd, and cluster-level like Heapster)
  • Critical user infrastructure (e.g. storage servers, monitoring system like Prometheus, etc.)
  • Components that are in the user-facing request serving path and must be able to scale up arbitrarily in response to load spikes (web servers, middleware, etc.)
  • Important interruptible workloads that need strong guarantee of schedulability and of not being interrupted
  • Less important interruptible workloads that need a less strong guarantee of schedulability and of not being interrupted
  • Best effort / opportunistic

Resolving priority class names

User requests sent to Kubernetes may have PriorityClassName in their PodSpec. Admission controller resolves a PriorityClassName to its corresponding number and populates the "Priority" field of the pod spec. The rest of Kubernetes components look at the "Priority" field of pod status and work with the integer value. In other words, PriorityClassName will be ignored by the rest of the system.

We are going to add a new API object called PriorityClass. The priority class defines the mapping between the priority name and its value. It can have an optional description. It is an arbitrary string and is provided only as a guideline for users.

A priority class can be marked as "Global Default" by setting its GlobalDefault field to true. If a pod does not specify any PriorityClassName, the system resolves it to the value of the global default priority class if exists. If there is no global default, the pod's priority will be resolved to zero. Priority admission controller ensures that there is only one global default priority class.

type PriorityClass struct {
  metav1.TypeMeta
  // +optional
  metav1.ObjectMeta
  
  // The value of this priority class. This is the actual priority that pods
  // receive when they have the above name in their pod spec.
  Value        int32
  GlobalDefault     bool
  Description       string
}

Ordering of priorities

As mentioned earlier, a PriorityClassName is resolved by the admission controller to its integral value and Kubernetes components use the integral value. The higher the value, the higher the priority.

System Priority Class Names

There will be special priority class names reserved for system use only. These classes have a value larger than one billion. Priority admission controller ensures that new priority classes will be not created with those names. They are used for critical system pods that must not be preempted. We set default policies that deny creation of pods with PriorityClassNames corresponding to these priorities. Cluster admins can authorize users or service accounts to create pods with these priorities. When non-authorized users set PriorityClassName to one of these priority classes in their pod spec, their pod creation request will be rejected. For pods created by controllers, the service account must be authorized by cluster admins.

Modifying priority classes

Priority classes can be added or removed, but their name and value cannot be updated. We allow updating GlobalDefault and Description as long as there is a maximum of one global default. While Kubernetes can work fine if priority classes are changed at run-time, the change can be confusing to users as pods with a priority class which were created before the change will have a different priority value than those created after the change. Deletion of priority classes is allowed, despite the fact that there may be existing pods that have specified such priority class names in their pod spec. In other words, there will be no referential integrity for priority classes. This is another reason that all system components should only work with the integer value of the priority and not with the PriorityClassName.

One could delete an existing priority class and create another one with the same name and a different value. By doing so, they can achieve the same effect as updating a priority class, but we still do not allow updating priority classes to prevent accidental changes.

Newly added priority classes cannot have a value higher than what is reserved for "system". The reason for this restriction is that Kubernetes critical system pods will have one of the "system" priorities and no pod should be able to preempt them.

Drawbacks of changing priority classes

While Kubernetes effectively allows changing priority classes (by deleting and adding them with a different value), it should be done only when absolutely needed. Changing priority classes has the following disadvantages:

  • May remove config portability: pod specs written for one cluster are no longer guaranteed to work on a different cluster if the same priority classes do not exist in the second cluster.
  • If quota is specified for existing priority classes (at the time of this writing, we don't have this feature in Kubernetes), adding or deleting priority classes will require reconfiguration of quota allocations.
  • An existing pods may have an integer value of priority that does not reflect the current value of its PriorityClass.

Priority and QoS classes

Kubernetes has three QoS classes which are derived from request and limit of pods. Priority is introduced as an independent concept; meaning that any QoS class may have any valid priority. When a node is out of resources and pods needs to be preempted, we give priority a higher weight over QoS classes. In other words, we preempt the lowest priority pod and break ties with some other metrics, such as, QoS class, usage above request, etc. This is not finalized yet. We will discuss and finalize preemption in a separate doc.