Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ID map mounts without userns #3943

Closed

Conversation

eiffel-fl
Copy link
Contributor

Hi.

This PR improves the initial support for ID map mounts which was merged in #3717.
Now, it is possible to use ID map mounts without the need of having a user namespace.

Regarding the design, I added two new attribute to send the UID and GID mappings corresponding to the mount sources.
I would like to have your opinion on this.

Best regards and thank you in advance.

@eiffel-fl eiffel-fl force-pushed the francis/no-userns-idmap branch 5 times, most recently from 6306e71 to 8c3234a Compare July 21, 2023 15:08
Commit fda12ab ("Support idmap mounts on volumes") introduces support for
idmap mount on volumes.
At this time, it required userns to be used and idmap mount must have the same
UID/GID mappings than userns.
This commit removes this requirement, so it is possible to use idmap mount
without userns.
You can, of course, use id map mount with userns, but the UID/GID mappings can
be different.

Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
@eiffel-fl
Copy link
Contributor Author

I added some tests locally particularly when several UID/GID mappings are used and the whole thing hang.
I think I did something wrong with the messaging, so I will need to investigate this!
Anyway, reviews are still welcome :D!

@AkihiroSuda
Copy link
Member

cc @rata

@AkihiroSuda
Copy link
Member

Could you update:

The following features are implemented with some limitations:
Spec version | Feature | Limitation
-------------|------------------------------------------|----------------------------------------------------------
v1.1.0-rc.1 | `.[]mounts.uidMappings` | Requires using UserNS with identical uidMappings
v1.1.0-rc.1 | `.[]mounts.gidMappings` | Requires using UserNS with identical gidMappings

int gidmap_len = strlen(gidmap_src);

/* Update child mappings from the parent. */
update_uidmap(config->uidmappath, child_pid, uidmap_src, uidmap_len);
Copy link
Member

@lifubang lifubang Jul 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't check the conflict of idmap items when len(UIDMappings) > 1, so will get an error in update_uidmap when the uid map data is invalid, but it uses bail to print the error and call exit, so it will not kill child_pid at that time and will got stuck. So we should make two changes:

  1. validate there is a conflict of idmap items or not;
  2. don't use bail to print error msg and return an error in update_uidmap. (
    static void update_uidmap(const char *path, int pid, char *map, size_t map_len)
    {
    if (map == NULL || map_len == 0)
    return;
    write_log(DEBUG, "update /proc/%d/uid_map to '%s'", pid, map);
    if (write_file(map, map_len, "/proc/%d/uid_map", pid) < 0) {
    if (errno != EPERM)
    bail("failed to update /proc/%d/uid_map", pid);
    write_log(DEBUG, "update /proc/%d/uid_map got -EPERM (trying %s)", pid, path);
    if (try_mapping_tool(path, pid, map, map_len))
    bail("failed to use newuid map on %d", pid);
    }
    }
    )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are the same problems in update_setgroups and update_gidmap.
So, when we clone a new process, we should double check whether there is a bail or not when we call a function.

Copy link
Contributor Author

@eiffel-fl eiffel-fl Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!
I will wait a bit to see the results of the below discussion, but if we decide to go with the current solution I will for sure add a commit to change update_*() to return an error code and bail in the caller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this as well. I'm playing around with using PR_SET_PDEATHSIG to avoid us having to remember to kill children...

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to nsexec.c for this feature. We have two other options for how to implement this:

  1. Use syscall.ForkExec to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd to rootfs_linux.go.
  2. Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than kill(getpid(), SIGSTOP) there should be no async signal-safety issues. This would be faster than option (1).

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

Copy link
Contributor Author

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to nsexec.c for this feature. We have two other options for how to implement this:

Can you share some source code regarding your implementation?

1. Use `syscall.ForkExec` to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd to `rootfs_linux.go`.

2. Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than `kill(getpid(), SIGSTOP)` there should be no async signal-safety issues. This would be faster than option (1).

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace.
To do so, using clone() seems to be the only solution we have.

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

@cyphar
Copy link
Member

cyphar commented Jul 24, 2023

Can you share some source code regarding your implementation?

I'll post a PR once I have it working. I'm also doing some cleanups of nsexec.c, which involves moving the logic for mountfds and idmapped file descriptors out of nsexec.c. The reason for my NACK is not that this one change is somehow bad, but simply that I want us to implement all of these features in Go.

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace. To do so, using clone() seems to be the only solution we have.

Both options I listed would result in a child process in a new user namespace.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

The first version of this PR was already implemented and was piggy-backing on the mountfd code that already existed, so rejecting it for that reason would've been unreasonable. At the time I thought there wasn't another way of doing the mountfd handling, but I'd forgotten that open_tree(OPEN_TREE_CLONE) bypasses the check_mnt() handling that required the opening of the file descriptors be done in the mount namespace -- meaning that we can do mountfd in the Go code.

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

@eiffel-fl
Copy link
Contributor Author

Can you share some source code regarding your implementation?

I'll post a PR once I have it working. I'm also doing some cleanups of nsexec.c, which involves moving the logic for mountfds and idmapped file descriptors out of nsexec.c. The reason for my NACK is not that this one change is somehow bad, but simply that I want us to implement all of these features in Go.

No problem, I share your mind regarding coding the most of thing in Golang would be better.

I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace. To do so, using clone() seems to be the only solution we have.

Both options I listed would result in a child process in a new user namespace.

OK, I think I lack some background there.

More generally, I do not understand your comment about adding code in nsexec.c.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.

The first version of this PR was already implemented and was piggy-backing on the mountfd code that already existed, so rejecting it for that reason would've been unreasonable. At the time I thought there wasn't another way of doing the mountfd handling, but I'd forgotten that open_tree(OPEN_TREE_CLONE) bypasses the check_mnt() handling that required the opening of the file descriptors be done in the mount namespace -- meaning that we can do mountfd in the Go code.

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

Do you have any ETA for your work?
If it should arrive soon, I then think it would just be better to close this PR and review/test yours.

@rata
Copy link
Member

rata commented Jul 24, 2023

@cyphar

And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c as well.

Oh, cool. While you are at it, maybe we can avoid the fd passing mechanism (if we open the fds before forking/clone/whatever, maybe we can just keep the fd table)? That is something that was on my list since I saw the code, but I'm not sure it will be simple to remove it, crun has it too...

If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed.

From a consumer POV (we did this for Kubernetes), we don't need to lift this limitation. So, I'm ok if you plan to work on this other implementation and lift the limitation there. Can you please cc me in the PR? :)

@kolyshkin
Copy link
Contributor

Agree with @cyphar -- if we can do it in Go, we should do it in Go.

Overall I very much hope we'll eventually be able to do all of it in Go. For example, with cgroupfd support in the kernel (since v5.7) and golang stdlib (since 1.20), we can enter cgroups way easier.

@cyphar
Copy link
Member

cyphar commented Aug 1, 2023

@rata We can avoid the fd passing mechanism, but I'm a little bit concerned about O_CLOEXEC issues. But then again, by definition these file descriptors are Totally Fine:tm: to pass through to the container because OPEN_TREE_CLONE descriptors are in a different namespace and thus cannot be used to escape to the host (and they are the mount root of bind-mounts in the container).

@kolyshkin Never say never, and I would love to remove all the C code from our codebase, but I'm not sure if even on newer kernels and with the newest stdlib we will be able to do that (at the very least I don't think Go has handling for newuidmap -- though this could of course be added). I can come up with a list of things to do in a separate issue if you want to have a chat about the problem. For one thing, I think that (for performance and security reasons) we almost certainly want to implement the runc userns creation for mount_setattr(2) in CGO as a (slightly unsafe) fork. CLONE_INTO_CGROUP is something we might want but as I mentioned in #3931, cgroupv2 doesn't migrate memory usage when moving cgroups, so if we use CLONE_INTO_CGROUP we will need to also move the ensure_cloned_binary() logic out of runc init -- though we can always implement in Go so this is probably not that big of a deal.

@eiffel-fl I should have a PoC together this week. I'm doing some other cleanups at the same time, so it might take me a little longer...

@eiffel-fl
Copy link
Contributor Author

I should have a PoC together this week. I'm doing some other cleanups at the same time, so it might take me a little longer..

OK, just ping me in the PR once you open it, so I can take a look (but no emergency, take the time you need to polish everything).

@cyphar
Copy link
Member

cyphar commented Aug 6, 2023

#3953 has a working implementation of this, along with several other cleanups to nsexec. There is a single test failure related to criu (which I can't reproduce locally) that I'm debugging, but the code clearly works.

@eiffel-fl
Copy link
Contributor Author

#3953 has a working implementation of this, along with several other cleanups to nsexec. There is a single test failure related to criu (which I can't reproduce locally) that I'm debugging, but the code clearly works.

I will take a look at it this week!

@eiffel-fl
Copy link
Contributor Author

Closing as #3953 supersedes it.

@eiffel-fl eiffel-fl closed this Aug 7, 2023
@cyphar
Copy link
Member

cyphar commented Aug 20, 2023

Reopening to track this properly. This will be fixed by #3985.

@cyphar cyphar reopened this Aug 20, 2023
@cyphar
Copy link
Member

cyphar commented Aug 20, 2023

Ah, this is a PR. Oops.

@cyphar cyphar closed this Aug 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants