Support for ID map mounts without userns #3943

eiffel-fl · 2023-07-21T12:23:42Z

Hi.

This PR improves the initial support for ID map mounts which was merged in #3717.
Now, it is possible to use ID map mounts without the need of having a user namespace.

Regarding the design, I added two new attribute to send the UID and GID mappings corresponding to the mount sources.
I would like to have your opinion on this.

Best regards and thank you in advance.

Commit fda12ab ("Support idmap mounts on volumes") introduces support for idmap mount on volumes. At this time, it required userns to be used and idmap mount must have the same UID/GID mappings than userns. This commit removes this requirement, so it is possible to use idmap mount without userns. You can, of course, use id map mount with userns, but the UID/GID mappings can be different. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>

eiffel-fl · 2023-07-21T16:21:24Z

I added some tests locally particularly when several UID/GID mappings are used and the whole thing hang.
I think I did something wrong with the messaging, so I will need to investigate this!
Anyway, reviews are still welcome :D!

AkihiroSuda · 2023-07-23T04:12:05Z

cc @rata

AkihiroSuda · 2023-07-23T04:14:02Z

Could you update:

runc/docs/spec-conformance.md

Lines 21 to 25 in b4f3891

    
           The following features are implemented with some limitations: 
        
           Spec version | Feature                                  | Limitation 
        
           -------------|------------------------------------------|---------------------------------------------------------- 
        
           v1.1.0-rc.1  | `.[]mounts.uidMappings`                  | Requires using UserNS with identical uidMappings 
        
           v1.1.0-rc.1  | `.[]mounts.gidMappings`                  | Requires using UserNS with identical gidMappings

lifubang · 2023-07-23T12:28:49Z

libcontainer/nsenter/nsexec.c

+		int gidmap_len = strlen(gidmap_src);
+
+		/* Update child mappings from the parent. */
+		update_uidmap(config->uidmappath, child_pid, uidmap_src, uidmap_len);


Because we don't check the conflict of idmap items when len(UIDMappings) > 1, so will get an error in update_uidmap when the uid map data is invalid, but it uses bail to print the error and call exit, so it will not kill child_pid at that time and will got stuck. So we should make two changes:

validate there is a conflict of idmap items or not;

don't use bail to print error msg and return an error in update_uidmap. (

runc/libcontainer/nsenter/nsexec.c

Lines 288 to 301 in a7afcd3

static void update_uidmap(const char *path, int pid, char *map, size_t map_len)

{

if (map == NULL || map_len == 0)

return;

write_log(DEBUG, "update /proc/%d/uid_map to '%s'", pid, map);

if (write_file(map, map_len, "/proc/%d/uid_map", pid) < 0) {

if (errno != EPERM)

bail("failed to update /proc/%d/uid_map", pid);

write_log(DEBUG, "update /proc/%d/uid_map got -EPERM (trying %s)", pid, path);

if (try_mapping_tool(path, pid, map, map_len))

bail("failed to use newuid map on %d", pid);

}

}

)

I think there are the same problems in update_setgroups and update_gidmap.
So, when we clone a new process, we should double check whether there is a bail or not when we call a function.

Good catch!
I will wait a bit to see the results of the below discussion, but if we decide to go with the current solution I will for sure add a commit to change update_*() to return an error code and bail in the caller.

I noticed this as well. I'm playing around with using PR_SET_PDEATHSIG to avoid us having to remember to kill children...

cyphar

Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to nsexec.c for this feature. We have two other options for how to implement this:

Use syscall.ForkExec to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd to rootfs_linux.go.
Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than kill(getpid(), SIGSTOP) there should be no async signal-safety issues. This would be faster than option (1).

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

eiffel-fl

Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to nsexec.c for this feature. We have two other options for how to implement this:

Can you share some source code regarding your implementation?

1. Use `syscall.ForkExec` to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd to `rootfs_linux.go`.

2. Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than `kill(getpid(), SIGSTOP)` there should be no async signal-safety issues. This would be faster than option (1).

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace.
To do so, using clone() seems to be the only solution we have.

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

cyphar · 2023-07-24T08:43:28Z

Can you share some source code regarding your implementation?

I'll post a PR once I have it working. I'm also doing some cleanups of nsexec.c, which involves moving the logic for mountfds and idmapped file descriptors out of nsexec.c. The reason for my NACK is not that this one change is somehow bad, but simply that I want us to implement all of these features in Go.

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace. To do so, using clone() seems to be the only solution we have.

Both options I listed would result in a child process in a new user namespace.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

The first version of this PR was already implemented and was piggy-backing on the mountfd code that already existed, so rejecting it for that reason would've been unreasonable. At the time I thought there wasn't another way of doing the mountfd handling, but I'd forgotten that open_tree(OPEN_TREE_CLONE) bypasses the check_mnt() handling that required the opening of the file descriptors be done in the mount namespace -- meaning that we can do mountfd in the Go code.

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

eiffel-fl · 2023-07-24T09:41:44Z

Can you share some source code regarding your implementation?

I'll post a PR once I have it working. I'm also doing some cleanups of nsexec.c, which involves moving the logic for mountfds and idmapped file descriptors out of nsexec.c. The reason for my NACK is not that this one change is somehow bad, but simply that I want us to implement all of these features in Go.

No problem, I share your mind regarding coding the most of thing in Golang would be better.

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace. To do so, using clone() seems to be the only solution we have.

Both options I listed would result in a child process in a new user namespace.

OK, I think I lack some background there.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

The first version of this PR was already implemented and was piggy-backing on the mountfd code that already existed, so rejecting it for that reason would've been unreasonable. At the time I thought there wasn't another way of doing the mountfd handling, but I'd forgotten that open_tree(OPEN_TREE_CLONE) bypasses the check_mnt() handling that required the opening of the file descriptors be done in the mount namespace -- meaning that we can do mountfd in the Go code.

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

Do you have any ETA for your work?
If it should arrive soon, I then think it would just be better to close this PR and review/test yours.

rata · 2023-07-24T10:45:52Z

@cyphar

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

Oh, cool. While you are at it, maybe we can avoid the fd passing mechanism (if we open the fds before forking/clone/whatever, maybe we can just keep the fd table)? That is something that was on my list since I saw the code, but I'm not sure it will be simple to remove it, crun has it too...

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

From a consumer POV (we did this for Kubernetes), we don't need to lift this limitation. So, I'm ok if you plan to work on this other implementation and lift the limitation there. Can you please cc me in the PR? :)

kolyshkin · 2023-08-01T00:37:40Z

Agree with @cyphar -- if we can do it in Go, we should do it in Go.

Overall I very much hope we'll eventually be able to do all of it in Go. For example, with cgroupfd support in the kernel (since v5.7) and golang stdlib (since 1.20), we can enter cgroups way easier.

cyphar · 2023-08-01T01:00:03Z

@rata We can avoid the fd passing mechanism, but I'm a little bit concerned about O_CLOEXEC issues. But then again, by definition these file descriptors are Totally Fine:tm: to pass through to the container because OPEN_TREE_CLONE descriptors are in a different namespace and thus cannot be used to escape to the host (and they are the mount root of bind-mounts in the container).

@kolyshkin Never say never, and I would love to remove all the C code from our codebase, but I'm not sure if even on newer kernels and with the newest stdlib we will be able to do that (at the very least I don't think Go has handling for newuidmap -- though this could of course be added). I can come up with a list of things to do in a separate issue if you want to have a chat about the problem. For one thing, I think that (for performance and security reasons) we almost certainly want to implement the runc userns creation for mount_setattr(2) in CGO as a (slightly unsafe) fork. CLONE_INTO_CGROUP is something we might want but as I mentioned in #3931, cgroupv2 doesn't migrate memory usage when moving cgroups, so if we use CLONE_INTO_CGROUP we will need to also move the ensure_cloned_binary() logic out of runc init -- though we can always implement in Go so this is probably not that big of a deal.

@eiffel-fl I should have a PoC together this week. I'm doing some other cleanups at the same time, so it might take me a little longer...

eiffel-fl · 2023-08-01T11:19:43Z

I should have a PoC together this week. I'm doing some other cleanups at the same time, so it might take me a little longer..

OK, just ping me in the PR once you open it, so I can take a look (but no emergency, take the time you need to polish everything).

cyphar · 2023-08-06T01:59:54Z

#3953 has a working implementation of this, along with several other cleanups to nsexec. There is a single test failure related to criu (which I can't reproduce locally) that I'm debugging, but the code clearly works.

eiffel-fl · 2023-08-07T07:58:53Z

#3953 has a working implementation of this, along with several other cleanups to nsexec. There is a single test failure related to criu (which I can't reproduce locally) that I'm debugging, but the code clearly works.

I will take a look at it this week!

eiffel-fl · 2023-08-07T13:09:11Z

Closing as #3953 supersedes it.

cyphar · 2023-08-20T14:03:26Z

Reopening to track this properly. This will be fixed by #3985.

cyphar · 2023-08-20T14:03:41Z

Ah, this is a PR. Oops.

eiffel-fl force-pushed the francis/no-userns-idmap branch 5 times, most recently from 6306e71 to 8c3234a Compare July 21, 2023 15:08

eiffel-fl added 2 commits July 21, 2023 18:17

tests/integration: Add tests for idmap mounts with userns

a7afcd3

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>

eiffel-fl force-pushed the francis/no-userns-idmap branch from 8c3234a to a7afcd3 Compare July 21, 2023 16:19

lifubang requested changes Jul 23, 2023

View reviewed changes

cyphar requested changes Jul 23, 2023

View reviewed changes

eiffel-fl commented Jul 24, 2023

View reviewed changes

This was referenced Aug 1, 2023

nsexec: moving as much as we can to Go #3951

Open

nsexec: spring cleaning #3953

Closed

eiffel-fl closed this Aug 7, 2023

cyphar reopened this Aug 20, 2023

cyphar closed this Aug 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for ID map mounts without userns #3943

Support for ID map mounts without userns #3943

eiffel-fl commented Jul 21, 2023

eiffel-fl commented Jul 21, 2023

AkihiroSuda commented Jul 23, 2023

AkihiroSuda commented Jul 23, 2023

lifubang Jul 23, 2023 •

edited

Loading

lifubang Jul 23, 2023

eiffel-fl Jul 24, 2023 •

edited

Loading

cyphar Aug 1, 2023

cyphar left a comment

eiffel-fl left a comment

cyphar commented Jul 24, 2023

eiffel-fl commented Jul 24, 2023

rata commented Jul 24, 2023 •

edited

Loading

kolyshkin commented Aug 1, 2023

cyphar commented Aug 1, 2023 •

edited

Loading

eiffel-fl commented Aug 1, 2023

cyphar commented Aug 6, 2023

eiffel-fl commented Aug 7, 2023

eiffel-fl commented Aug 7, 2023

cyphar commented Aug 20, 2023

cyphar commented Aug 20, 2023

	static void update_uidmap(const char path, int pid, char map, size_t map_len)
	{
	if (map == NULL \|\| map_len == 0)
	return;

	write_log(DEBUG, "update /proc/%d/uid_map to '%s'", pid, map);
	if (write_file(map, map_len, "/proc/%d/uid_map", pid) < 0) {
	if (errno != EPERM)
	bail("failed to update /proc/%d/uid_map", pid);
	write_log(DEBUG, "update /proc/%d/uid_map got -EPERM (trying %s)", pid, path);
	if (try_mapping_tool(path, pid, map, map_len))
	bail("failed to use newuid map on %d", pid);
	}
	}

Support for ID map mounts without userns #3943

Support for ID map mounts without userns #3943

Conversation

eiffel-fl commented Jul 21, 2023

eiffel-fl commented Jul 21, 2023

AkihiroSuda commented Jul 23, 2023

AkihiroSuda commented Jul 23, 2023

lifubang Jul 23, 2023 • edited Loading

Choose a reason for hiding this comment

lifubang Jul 23, 2023

Choose a reason for hiding this comment

eiffel-fl Jul 24, 2023 • edited Loading

Choose a reason for hiding this comment

cyphar Aug 1, 2023

Choose a reason for hiding this comment

cyphar left a comment

Choose a reason for hiding this comment

eiffel-fl left a comment

Choose a reason for hiding this comment

cyphar commented Jul 24, 2023

eiffel-fl commented Jul 24, 2023

rata commented Jul 24, 2023 • edited Loading

kolyshkin commented Aug 1, 2023

cyphar commented Aug 1, 2023 • edited Loading

eiffel-fl commented Aug 1, 2023

cyphar commented Aug 6, 2023

eiffel-fl commented Aug 7, 2023

eiffel-fl commented Aug 7, 2023

cyphar commented Aug 20, 2023

cyphar commented Aug 20, 2023

lifubang Jul 23, 2023 •

edited

Loading

eiffel-fl Jul 24, 2023 •

edited

Loading

rata commented Jul 24, 2023 •

edited

Loading

cyphar commented Aug 1, 2023 •

edited

Loading