Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[1.1] join the cgroup after the initial setup finished #4439

Open
wants to merge 2 commits into
base: release-1.1
Choose a base branch
from

Conversation

lifubang
Copy link
Member

@lifubang lifubang commented Oct 11, 2024

We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (#4427)

Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.

With this patch, it will eliminate
the impacts of memory accounting from ensure_clone_binary.

akhilerm added a commit to akhilerm/containerd that referenced this pull request Oct 11, 2024
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
}
}()

// We should join the cgroup after the initial setup finished,
// but before runc init clone new children processes. (#4427)
err = <-waitInit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A noob question: Will this have any performace impact as the join cgroup and init are now not parallel?

Copy link
Member Author

@lifubang lifubang Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

So, we need to do more detailed control between runc init and the main process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I have a test, to start 100 containers:
runc-1.1.15: 3.025s
With this patch: 4.123s

^^ Is this degradation within the acceptable limits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@opencontainers/runc-maintainers PTAL

Copy link
Member

@rata rata Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

@kolyshkin
Copy link
Contributor

IMO this should not be a backport, but rather an original PR targeted for release-1.1 specifically (see #4438 (comment)). In other words, it does not make sense for the main branch to have this code.

For 1.1, this commit makes sense. OR, we can port 0e9a335 and remove initWaiter entirely.

@kolyshkin
Copy link
Contributor

OR, we can port commit 0e9a335

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport @lifubang

@lifubang
Copy link
Member Author

Well, it will be hard to do I guess. So maybe this makes sense to have, as a hotfix. But please don't make it a backport

OK, please see my comments in #4438, if you still think it's not worth for the main branch, I'll open it as a hotfix.

We should join the cgroup after the initial setup finished,
but before runc init clone new children processes. (opencontainers#4427)
Because we should try our best to reduce the influence of
memory cgroup accounting from all runc init processes
before we start the container init process.

Signed-off-by: lifubang <lifubang@acmcoder.com>
@lifubang lifubang changed the title [backport 1.1] join the cgroup after the initial setup finished [1.1] join the cgroup after the initial setup finished Oct 13, 2024
@lifubang lifubang added this to the 1.1.16 milestone Oct 13, 2024
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lifubang LGTM, thanks again for this fix!

I see this is not mentioned as a backport, so that change @kolyshkin requested is done too :)

@@ -407,6 +407,13 @@ func (p *initProcess) start() (retErr error) {
}
}()

// We should join the cgroup after the initial setup finished,
// but before runc init clone new children processes. (#4427)
err = <-waitInit
Copy link
Member

@rata rata Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't see why it wouldn't be. 10ms more to start a container seems acceptable, I wouldn't be surprised if we have more noise from go GC or other code changes. Am I missing something?

@rata
Copy link
Member

rata commented Oct 14, 2024

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

diff --git a/libcontainer/process_linux.go b/libcontainer/process_linux.go
index ac3b104e..da4db4e2 100644
--- a/libcontainer/process_linux.go
+++ b/libcontainer/process_linux.go
@@ -407,6 +407,14 @@ func (p *initProcess) start() (retErr error) {
                }
        }()
 
+       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
+               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
+       }
+       err = <-waitInit
+       if err != nil {
+               return err
+       }
+
        // Do this before syncing with child so that no children can escape the
        // cgroup. We don't need to worry about not doing this and not being root
        // because we'd be using the rootless cgroup manager in that case.
@@ -418,14 +426,6 @@ func (p *initProcess) start() (retErr error) {
                        return fmt.Errorf("unable to apply Intel RDT configuration: %w", err)
                }
        }
-       if _, err := io.Copy(p.messageSockPair.parent, p.bootstrapData); err != nil {
-               return fmt.Errorf("can't copy bootstrap data to pipe: %w", err)
-       }
-       err = <-waitInit
-       if err != nil {
-               return err
-       }
-
        childPid, err := p.getChildPid()
        if err != nil {
                return fmt.Errorf("can't get final child's PID from pipe: %w", err)

@kolyshkin
Copy link
Contributor

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit.

I did some simple testing (similar to what @lifubang did). Compared to current runc (1.1.15), seeing ~40% higher times without the suggested change, and ~25% higher times with it. So, yes, it helps a bit.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Oh, I just realize checking my TODO list. Can you revert #4423 here too? It could help us to see that not being needed to gain more confidence.

How about to keep this?
Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.
We should know that we need #4020 is because of #3931, at that time, we have not yet move binary clone from runc init to runc parent process.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

I think we can gain some of the lost speed back by moving io.Copy of the bootstrapData together with waitInit. IOW:

Maybe the suggest change will cause a new race?
In fact, besides we use bootstrapData as a data transfer between runc init and runc parent process, we also use it as a sync mechanism to ensure runc parent process has put the runc init process in the cgroup, before runc init clone new children process. Please see:

// Do this before syncing with child so that no children can escape the
// cgroup. We don't need to worry about not doing this and not being root

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Maybe we can add another wait before ‘runc init’ clones new child process. But I don’t know whether it worth to do or not.

EDIT: I test it, there is no benefit, so it's not worth to do this.

@lifubang
Copy link
Member Author

lifubang commented Oct 15, 2024

Furthermore, maybe we can just put only the last ‘runc init’ process into the cgroup, because it’s the real container process. But there is a kernel error when putting the process into an cgroup after the process has joined a new pid/cgroup namespace.

EDIT: It seems no good idea to fix this error, so it's also not worth to do this.

@lifubang
Copy link
Member Author

It seems no other way to speed up, and it has the same speed as in the main branch.
I think we should focus on how to raise the speed in the main branch.

@lifubang
Copy link
Member Author

Because I think #4020 also can be reverted in the main branch too, but needs more test. We need to do revert in the main branch first.

The revert PR for the main branch is #4446, it seems that it could work fine.

As we will fix the race between binary clone and cgroup join, we can eliminate the
impacts of memory accounting from ensure_clone_binary. So runc will support lower memory
useage the same as before.

This reverts commit 719e2bc.

Signed-off-by: lifubang <lifubang@acmcoder.com>
@lifubang
Copy link
Member Author

Can you revert #4423 here too?

Has reverted it.

@rata
Copy link
Member

rata commented Oct 15, 2024

@kolyshkin I agree with @lifubang here. If we move it as you proposed here, then nsexec has all the info to parse the netlink data and might fork before we add it to the cgroup. Or is there some other point that will stop this race from happening?

I really don't see why this perf impact is a big issue. Is not 10ms per container, and therefore within the noise of a go GC or some other things we do?

@cyphar
Copy link
Member

cyphar commented Oct 16, 2024

Maybe we can just backport the code that moved the copying logic rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x? Or we can revert the removal of bindfd from 1.1.x.

(Sorry for not catching this when doing the original patch nor the backport...)

@lifubang
Copy link
Member Author

lifubang commented Oct 16, 2024

rather than trying to come up with a bespoke fix for 1.1.x that isn't in 1.2.x?

In fact, it should also be fixed in the main branch after merged #3931, instead of #4020. And then backport to release-1.1.
But I'm sorry that when I wrote #4020, I didn't catch this race.
From this inspect, it's reasonable to fix it in release-1.1.

As you say, there are two other ways to fix this race:

  1. Backport the code that moved the copying logic from runc init to runc parent process;
  2. Revert [1.1] nsenter: cloned_binary: remove bindfd logic entirely #4392.
    

I think both are OK. It's just a choice question.

muicoder pushed a commit to muicoder/runc that referenced this pull request Oct 16, 2024
muicoder pushed a commit to muicoder/runc that referenced this pull request Oct 16, 2024
muicoder pushed a commit to muicoder/runc that referenced this pull request Oct 16, 2024
@rata
Copy link
Member

rata commented Oct 16, 2024

@cyphar bindfd is causing a lot of real issues for us in 1.1. That is not a good way forward :(

Those other changes are huge to backport, it is quite a risky backport IMHO, maybe time consuming too. Why don't we do this trivial PR, that fixes exactly what we need, containerd at least (maybe other prjects too) are waiting for this fix. And we can aim to release 1.2 soon.

If we fail to release 1.2 soon, we can consider that backport. But without the rush of having users waiting for a fix.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants