Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set temporary single CPU affinity before cgroup cpuset transition. #3923

Merged
merged 1 commit into from
Apr 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/isolated-cpu-affinity-transition.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
## Isolated CPU affinity transition

The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version should be shown here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

in 5.7 has affected a deterministic scheduling behavior by distributing tasks
across CPU cores within a cgroups cpuset. It means that `runc exec` might be
impacted under some circumstances, by example when a container has been
created within a cgroup cpuset entirely composed of isolated CPU cores
usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters.

Some containerized real-time applications are relying on this deterministic
behavior and uses the first CPU core to run a slow thread while other CPU
cores are fully used by the real-time threads with SCHED_FIFO policy.
Such applications can prevent runc process from joining a container when the
runc process is randomly scheduled on a CPU core owned by a real-time thread.

Runc introduces a way to restore this behavior by adding the following
annotation to the container runtime spec (`config.json`):

`org.opencontainers.runc.exec.isolated-cpu-affinity-transition`
cclerget marked this conversation as resolved.
Show resolved Hide resolved

This annotation can take one of those values:

* `temporary` to temporarily set the runc process CPU affinity to the first
isolated CPU core of the container cgroup cpuset.
* `definitive`: to definitively set the runc process CPU affinity to the first
isolated CPU core of the container cgroup cpuset.

For example:

```json
"annotations": {
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"
}
```

__WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and
above.

### How it works?

When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel
boot parameter value and considers the CPUs in the list as isolated, it doesn't
look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is
identical to `nohz_full` when specified. If `nohz_full` parameter is not found,
runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`.

Once it gets the isolated CPU list, it returns an eligible CPU core within the
container cgroup cpuset based on those heuristics:

* when there is not cpuset cores: no eligible CPU
* when there is not isolated cores: no eligible CPU
* when cpuset cores are not in isolated core list: no eligible CPU
* when cpuset cores are all isolated cores: return the first CPU of the cpuset
* when cpuset cores are mixed between housekeeping/isolated cores: return the
first housekeeping CPU not in isolated CPUs.

The returned CPU core is then used to set the `runc init` CPU affinity before
the container cgroup cpuset transition.

#### Transition example

`nohz_full` has the isolated cores `4-7`. A container has been created with
the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7.
`runc exec` is called by a process with CPU affinity set to `0-3`

* with `temporary` transition:

runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7)

* with `definitive` transition:

runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4)

The difference between `temporary` and `definitive` is the container process
affinity, `definitive` will constraint the container process to run on the
first isolated CPU core of the cgroup cpuset, while `temporary` restore the
CPU affinity to match the container cgroup cpuset.

`definitive` transition might be helpful when `nohz_full` is used without
`isolcpus` to avoid runc and container process to be a noisy neighbour for
real-time applications.

### How to use it with Kubernetes?

Kubernetes doesn't manage container directly, instead it uses the Container Runtime
Interface (CRI) to communicate with a software implementing this interface and responsible
to manage the lifecycle of containers. There are popular CRI implementations like Containerd
and CRI-O. Those implementations allows to pass pod annotations to the container runtime
via the container runtime spec. Currently runc is the runtime used by default for both.

#### Containerd configuration

Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc.
You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod
annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`:

```toml
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "runc"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "io.containerd.runc.v2"
base_runtime_spec = "/etc/containerd/cri-base.json"
pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container_annotations might be better here, as the annotation does not really relate to Kubernetes pods
https://github.com/containerd/containerd/blob/v1.7.14/docs/cri/config.md?plain=1#L310

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my tests only pod_annotations is working and pass the annotation to sandbox and container, container_annotations are coming from device plugins and kubernetes internals, like stated in the comment this is not controlled by users, preventing users to enable the feature on a per-pod basis

```

#### CRI-O configuration

CRI-O doesn't require any extra step, however some annotations could be excluded by
configuration.

#### Pod deployment example

```yaml
apiVersion: v1
kind: Pod
metadata:
name: demo-pod
annotations:
org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed for containerd IIUC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's related to your comment about container_annotations, because I don't see how runc could enable the feature without that

spec:
containers:
- name: demo
image: registry.com/demo:latest
```
1 change: 1 addition & 0 deletions features.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ var featuresCommand = cli.Command{
"bundle",
"org.systemd.property.", // prefix form
"org.criu.config",
"org.opencontainers.runc.exec.isolated-cpu-affinity-transition",
},
}

Expand Down
4 changes: 4 additions & 0 deletions libcontainer/cgroups/cgroups.go
Original file line number Diff line number Diff line change
Expand Up @@ -71,4 +71,8 @@ type Manager interface {

// OOMKillCount reports OOM kill count for the cgroup.
OOMKillCount() (uint64, error)

// GetEffectiveCPUs returns the effective CPUs of the cgroup, an empty
// value means that the cgroups cpuset subsystem/controller is not enabled.
GetEffectiveCPUs() string
}
27 changes: 27 additions & 0 deletions libcontainer/cgroups/fs/fs.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ import (
"errors"
"fmt"
"os"
"path/filepath"
"strings"
"sync"

"golang.org/x/sys/unix"
Expand Down Expand Up @@ -263,3 +265,28 @@ func (m *Manager) OOMKillCount() (uint64, error) {

return c, err
}

func (m *Manager) GetEffectiveCPUs() string {
return GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
}

func GetEffectiveCPUs(cpusetPath string, cgroups *configs.Cgroup) string {
// Fast path.
if cgroups.CpusetCpus != "" {
return cgroups.CpusetCpus
} else if !strings.HasPrefix(cpusetPath, defaultCgroupRoot) {
return ""
}

// Iterates until it goes to the cgroup root path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makes sense to add something like "needed for containers in which cpuset controller is not enabled -- in this case a parent cgroup is used" -- if my understanding is correct.

// It's required for containers in which cpuset controller
// is not enabled, in this case a parent cgroup is used.
for path := cpusetPath; path != defaultCgroupRoot; path = filepath.Dir(path) {
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.effective_cpus")
if err == nil {
return cpus
}
}

return ""
}
28 changes: 28 additions & 0 deletions libcontainer/cgroups/fs2/fs2.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@ import (
"errors"
"fmt"
"os"
"path/filepath"
"strings"

"github.com/opencontainers/runc/libcontainer/cgroups"
"github.com/opencontainers/runc/libcontainer/cgroups/fscommon"
"github.com/opencontainers/runc/libcontainer/configs"
"github.com/opencontainers/runc/libcontainer/utils"
)

type parseError = fscommon.ParseError
Expand All @@ -32,6 +34,9 @@ func NewManager(config *configs.Cgroup, dirPath string) (*Manager, error) {
if err != nil {
return nil, err
}
} else {
// Clean path for safety.
dirPath = utils.CleanPath(dirPath)
}

m := &Manager{
Expand Down Expand Up @@ -316,3 +321,26 @@ func CheckMemoryUsage(dirPath string, r *configs.Resources) error {

return nil
}

func (m *Manager) GetEffectiveCPUs() string {
// Fast path.
if m.config.CpusetCpus != "" {
return m.config.CpusetCpus
} else if !strings.HasPrefix(m.dirPath, UnifiedMountpoint) {
return ""
}

// Iterates until it goes outside of the cgroup root path.
// It's required for containers in which cpuset controller
// is not enabled, in this case a parent cgroup is used.
outsidePath := filepath.Dir(UnifiedMountpoint)

for path := m.dirPath; path != outsidePath; path = filepath.Dir(path) {
cpus, err := fscommon.GetCgroupParamString(path, "cpuset.cpus.effective")
if err == nil {
return cpus
}
}

return ""
}
4 changes: 4 additions & 0 deletions libcontainer/cgroups/systemd/v1.go
Original file line number Diff line number Diff line change
Expand Up @@ -411,3 +411,7 @@ func (m *LegacyManager) Exists() bool {
func (m *LegacyManager) OOMKillCount() (uint64, error) {
return fs.OOMKillCount(m.Path("memory"))
}

func (m *LegacyManager) GetEffectiveCPUs() string {
return fs.GetEffectiveCPUs(m.Path("cpuset"), m.cgroups)
}
4 changes: 4 additions & 0 deletions libcontainer/cgroups/systemd/v2.go
Original file line number Diff line number Diff line change
Expand Up @@ -514,3 +514,7 @@ func (m *UnifiedManager) Exists() bool {
func (m *UnifiedManager) OOMKillCount() (uint64, error) {
return m.fsMgr.OOMKillCount()
}

func (m *UnifiedManager) GetEffectiveCPUs() string {
return m.fsMgr.GetEffectiveCPUs()
}
4 changes: 4 additions & 0 deletions libcontainer/container_linux_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ func (m *mockCgroupManager) GetFreezerState() (configs.FreezerState, error) {
return configs.Thawed, nil
}

func (m *mockCgroupManager) GetEffectiveCPUs() string {
return ""
}

type mockProcess struct {
_pid int
started uint64
Expand Down
Loading
Loading