Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --overlay and related options #547

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

rhendric
Copy link

@rhendric rhendric commented Jan 6, 2023

This commit adds --overlay, --tmp-overlay, --ro-overlay, and --overlay-src options to enable bubblewrap to create overlay mounts. These options are only permitted when bubblewrap is not installed setuid.


This is a continuation/partial rewrite of #167, addressing the feedback given there.

Other improvements of note:

  • Tests!
  • Characters treated specially by overlayfs are escaped; there are now no (known?) restrictions on the characters that can be in an overlay path.
  • --tmp-overlay is a new contribution for a use case I frequently have: I want to mount an overlay that is writable but I don't want to persist the writes between runs.

Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by to your commit(s) to indicate that you agree to the Developer Certificate of Origin terms.

@@ -332,6 +340,11 @@ usage (int ecode, FILE *out)
" --ro-bind SRC DEST Bind mount the host path SRC readonly on DEST\n"
" --ro-bind-try SRC DEST Equal to --ro-bind but ignores non-existent SRC\n"
" --remount-ro DEST Remount DEST as readonly; does not recursively remount\n"
" --overlay-src SRC Read files from SRC in the following overlay\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlayfs terminology for this seems to be that it is the "lower directory". I think it might be better to stick to the overlayfs terms instead of inventing our own, perhaps like this:

--overlay-lower LOWEST [--overlay-lower LOWER...] --overlay UPPER WORKDIR DEST
--overlay-lower LOWEST [--overlay-lower LOWER...] --ro-overlay DEST
--overlay-lower LOWEST [--overlay-lower LOWER...] --tmp-overlay DEST

Alternatively, if we are going to swap the order so that it's the same as lowerdir (uppermost first, lowest last), perhaps it could look like this, with the first argument "opening a transaction" and --overlay-mount "committing" it?

# Read-write
--overlay-upper UPPER WORKDIR [--overlay-lower LOWER...] --overlay-lower LOWEST --overlay-mount DEST
# Read-only
--ro-overlay [--overlay-lower LOWER...] --overlay-lower LOWEST --overlay-mount DEST
# Shortcut for tmpfs upper
--tmpfs-overlay [--overlay-lower LOWER...] --overlay-lower LOWEST --overlay-mount DEST

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so the decision to order bottom-first and the decision to change the terms used were linked.

I started with something very similar to what you're suggesting here, reusing overlayfs terminology and ordering. Moving away from the overlayfs syntax (colon-separated paths, as used in #167) prompted me to rethink this. Everything else in the bubblewrap CLI is more consistent with bottom-first:

  • The top of the documentation states that later options win out over earlier ones unless otherwise specified.
  • Filesystem operations proceed from left to right, so first you bind / and then you bind top-level directories and then directories inside that and so on.

For people not already used to thinking in overlayfs option strings, I thought that starting at the bottom would feel more of a piece with the rest of bubblewrap.

And having chosen that, I didn't want to have to explain in the documentation that while the kernel expects ‘lower directories’ to be provided top-first, we expect ‘lower directories’ bottom-first—the inexperienced reader would be learning about two things, only one of which is relevant to their use of bubblewrap. I thought—and I feel less confident in this choice than the other one—that changing the terminology would cut that knot. I figured it would suggest to people that know overlayfs that we aren't directly exposing the overlayfs lowerdir property that they're used to, without adding a distraction for the people who don't know overlayfs and just want to learn how to use bubblewrap. I went with src/source because that's the term already consistently used in the bubblewrap documentation to refer to host paths that are being mounted somehow into the container.

This is all pretty superficial and I don't want to fight you too hard on it; I'd rather have either of these designs implemented than none of them. Let me know if you disagree with my reasoning and I'll revise it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reasoning does make sense, I'll think some more about this.

bubblewrap.c Outdated
Comment on lines 3327 to 3328
char dirname[32];
snprintf (dirname, 32, "tmp-overlay-upper-%d", i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer this to use xasprintf so we don't have to think about whether the magic number 32 is sufficient.

Comment on lines +305 to +313
This option can be used multiple times to provide multiple sources.
The sources are overlaid from bottom to top: if a given path to be
read exists in more than one source, the file is read from the last
such source specified.
Copy link
Collaborator

@smcv smcv Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right to think that this is the opposite of the order seen in mount -o lowerdir? If it is, then I think the documentation should say so, perhaps something like this:

Suggested change
This option can be used multiple times to provide multiple sources.
The sources are overlaid from bottom to top: if a given path to be
read exists in more than one source, the file is read from the last
such source specified.
This option can be used multiple times to provide multiple lower
directories, with the lowest directory specified first and the
uppermost directory last. If the same path exists in more than
one of the directories, the file is read from the last directory
specified. Note that this is the reverse of the order used in
the <literal>lowerdir</literal> mount option.

But I think it might be less confusing if we order the arguments from uppermost to lowest, consistent with the order used in mount options.

bubblewrap.c Outdated
@@ -1159,6 +1178,87 @@ privileged_op (int privileged_op_socket,
}
}

struct _StringBuilder
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better in utils.c so it can be unit-tested.

bubblewrap.c Outdated
int len;

va_start (args, fmt);
len = vsnprintf (dest->str + dest->offset, dest->size - dest->offset, fmt, args);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
len = vsnprintf (dest->str + dest->offset, dest->size - dest->offset, fmt, args);
len = vsnprintf (dest->str + dest->offset, dest->size - dest->offset, fmt, args);
if (len < 0)
die_with_error ("vsnprintf");

@@ -523,4 +523,100 @@ echo "PWD=$(pwd -P)" > reference
assert_files_equal stdout reference
echo "ok - environment manipulation"

if test -n "${bwrap_is_suid:-}"; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part will also need to be skipped on older kernels where unprivileged mounting of overlayfs is not allowed. I'm not immediately sure what the best way to detect that is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me neither. It seems like different distributions may have started supporting this at different times—I've seen some posts that suggest Ubuntu was an early adopter—so I don't think a simple uname -r version check would suffice. But I don't know a good way of detecting this capability other than trying it, which is a real pain without a tool like, well, bubblewrap. 😐

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added what I hope is the right check for this. I checked that the ‘no kernel support’ branch is reached on a VM installed from alpine-virt-3.14.9-x86_64.

Copy link
Collaborator

@smcv smcv Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks plausible, at least. I should probably check this in my Debian CI environment (which involves an assortment of different kernel versions and container technologies) before merging, to make sure we don't have a bug similar to the one fixed in #554.

@smcv
Copy link
Collaborator

smcv commented Jan 6, 2023

The commit message should have Resolves: #412 or similar, to link it up to the feature request that you're implementing here.

@rhendric rhendric force-pushed the rhendric/overlayfs branch 2 times, most recently from 742d3af to f995ca0 Compare January 6, 2023 20:14
@rhendric rhendric force-pushed the rhendric/overlayfs branch 2 times, most recently from bc8044a to 23ff0f8 Compare March 5, 2023 00:24
@rhendric
Copy link
Author

rhendric commented Mar 5, 2023

The only outstanding thread here should be getting feedback on my reasoning for option ordering and naming. Please let me know what you think and if there's anything I can do to move this forward.

Copy link
Collaborator

@smcv smcv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating this. A re-review is on my list, but the PR is non-trivial and my list is not short.

@@ -332,6 +340,11 @@ usage (int ecode, FILE *out)
" --ro-bind SRC DEST Bind mount the host path SRC readonly on DEST\n"
" --ro-bind-try SRC DEST Equal to --ro-bind but ignores non-existent SRC\n"
" --remount-ro DEST Remount DEST as readonly; does not recursively remount\n"
" --overlay-src SRC Read files from SRC in the following overlay\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reasoning does make sense, I'll think some more about this.

@@ -523,4 +523,100 @@ echo "PWD=$(pwd -P)" > reference
assert_files_equal stdout reference
echo "ok - environment manipulation"

if test -n "${bwrap_is_suid:-}"; then
Copy link
Collaborator

@smcv smcv Mar 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks plausible, at least. I should probably check this in my Debian CI environment (which involves an assortment of different kernel versions and container technologies) before merging, to make sure we don't have a bug similar to the one fixed in #554.

@joanbm
Copy link

joanbm commented Apr 10, 2023

I ran into a race when running two bwrap processes consecutively using the same overlay, and the kernel has CONFIG_OVERLAY_FS_INDEX=y (this is the default on Arch Linux).

On my Arch Linux system, this example:

#!/usr/bin/env sh
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

Fails with this error:

bwrap: Can't make overlay mount on /newroot/home/zealcharm/bwrap_ofs_cleanup_test with options upperdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/rw,workdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test/wrk,lowerdir=/oldroot/home/zealcharm/bwrap_ofs_cleanup_test: Device or resource busy

And the error on dmesg is:

overlayfs: upperdir is in-use as upperdir/workdir of another mount, mount with '-o index=off' to override exclusive upperdir protection.

Inserting a short sleep or a sync in between thebwrap fixes the race.


Cause

I have taken a look and I believe that the cause is: Since I'm using --unshare-pid (without --as-pid-1), bwrap launches a reaper process as PID 1. The monitor bwrap process quits immediately after PID 2 (dd in the example above) terminates, but the PID 1 takes a while longer to terminate (it is visible in ps aux), so the overlayfs mount still lingers a little longer.

Trying --sync-fd

I tried using --sync-fd (I think that's the right way to use it, apologies if not) but it doesn't seem to fix it, the race still happens:

#!/usr/bin/env bash
set -eu

rm -rf $HOME/bwrap_ofs_cleanup_test
mkdir $HOME/bwrap_ofs_cleanup_test && cd $HOME/bwrap_ofs_cleanup_test
mkdir -p rw wrk

syncfile="$(mktemp)"
exec {syncfd_for_bwrap}<> "$syncfile"
exec {syncfd_for_wait}<> "$syncfile"
rm -f "$syncfile"
flock -x "$syncfd_for_bwrap"
bwrap --unshare-user --unshare-pid --sync-fd "$syncfd_for_bwrap" --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  dd if=/dev/zero of=dummy bs=1M count=10 status=none
exec {syncfd_for_bwrap}>&-
flock -x "$syncfd_for_wait"

bwrap --unshare-user --unshare-pid --dev-bind / / --overlay-src "$PWD" --overlay "$PWD"/rw "$PWD"/wrk "$PWD" \
  echo SUCCESS

I think the problem is the same, there's nothing that guarantees that the sync FD is closed after all mounts are torn down.

Waiting for PID 1 to terminate

I tried waiting for PID 1 to terminate, and this seems to work:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..e53f04b 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -560,6 +560,7 @@ monitor_child (int event_fd, pid_t child_pid, int setup_finished_fd)
             {
               exitc = (int) val - 1;
               report_child_exit_status (exitc, setup_finished_fd);
+              waitpid (child_pid, NULL, 0);
               return exitc;
             }
         }

However, I'm not sure if this is a bulletproof fix, and it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Workaround

For now I'm explicitly disabling the inode index to avoid running into the issue:

diff --git a/bubblewrap.c b/bubblewrap.c
index 5e90e5e..a177c75 100644
--- a/bubblewrap.c
+++ b/bubblewrap.c
@@ -1268,6 +1268,7 @@ setup_newroot (bool unshare_pid,
             if (mkdir (dest, 0755) != 0 && errno != EEXIST)
               die_with_error ("Can't mkdir %s", op->dest);
 
+            strappend (&sb, "index=off,");
             if (op->source != NULL)
               {
                 strappend (&sb, "upperdir=/oldroot");

@rhendric
Copy link
Author

it is not a proper fix since this will also cause bwrap to wait until all backgrounded processes finish instead of just the main process.

Honestly, this is what I thought bwrap already did. Is there a reason this isn't an improvement?

Maybe it could be enabled by a separate flag? I would imagine that waiting for everything in the sandbox to be well and truly finished is a feature that would be useful in other cases.

@joanbm
Copy link

joanbm commented Apr 10, 2023

TBH I thought the same, but you can see that that's not the case by running bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' then ps aux.

I guess it could be useful if you are launching something that daemonizes, but it's going to be trouble for anything that locks a resource like the overlayfs inode index.

So I also think that having flag(s) to wait or kill all processes instead of just for the main process could be useful for those scenarios.

Also, I suppose the waitpid solution above can be implemented without breaking compatibility by doing it only if just PID 1 remains after the main process exits.

@rusty-snake
Copy link
Contributor

doing it only if just PID 1 remains after the main process exits.

FWIW: if pid1 exists (no matter how) everything in the pid namespace will get a SIGKILL. I.e. there is no way pid1 has exited and deamons are still running.

Of course you only have a pid1 when you use unshare-pid.

@smcv
Copy link
Collaborator

smcv commented Apr 11, 2023

if pid1 exists

You mean if pid 1 exits, I think (unfortunately either word makes sense here, but only one is correct).

Yes, if we are unsharing the pid namespace for the sandbox, when pid 1 within that namespace exits (that's the user-specified process if you used --as-pid1, or bubblewrap's reaper process otherwise), everything else in the sandbox's pid namespace is killed (see pid_namespaces(7)).

However, I don't know whether they are killed synchronously before pid 1 is allowed to exit, or whether pid 1 can exit while the overlayfs resource is still in use...

Maybe [waiting for pid 1 to exit] could be enabled by a separate flag?

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it, analogous to docker run somecontainer sleep infinity followed by one or more docker exec to poke commands into the container.

@rhendric
Copy link
Author

If there is any situation in which the reaper process would not exit promptly, then that would make sense as an opt-in thing. Or if the reaper process would always exit promptly in any case, then it might as well be on by default, in all cases where the reaper process gets run. (In situations where the reaper is not run, of course we must not wait for it.)

If I understand the relevant terms correctly, we know that the reaper process doesn't exit until all children have terminated (the bwrap --unshare-all --dev-bind / / sh -c 'sleep 50 &' example demonstrates this). So having an opt-in flag for waiting would be beneficial, and it sounds like it would resolve the issue at hand without being coupled to the rest of the --overlay work. I propose tracking that as a separate issue/PR.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it. A user would have similar problems if background processes in the sandbox held any other sort of host resource past the lifetime of the parent sandbox command—I don't think we'd want to tell such users to disable any features their sandboxed programs are using that lock resources, as a long-term solution.

@joanbm
Copy link

joanbm commented Apr 28, 2023

Or, you might find that your use-case is better served by creating a long-running sandbox containing an IPC server, and then poking commands into it

I found the problem while trying to create a sandbox wrapper over an existing command line tool (makepkg, Arch Linux's package builder), with the idea that it works as a drop-in replacement for the original tool. Then, I have some scripts that run makepkg a few times consecutively that I switched to the wrapper, so that's how I ran into the issue. And sure, creating and tearing down the sandbox every time is suboptimal, but it's simple and works fast enough.

Perhaps my specific use case is a bit obscure, but I think the idea of being able to use overlays in drop-in tool wrappers that are used from scripts is something that ought to work.

So having an opt-in flag for waiting would be beneficial, [...]. I propose tracking that as a separate issue/PR.

I agree that creating a new issue/PR is the right approach, ultimately the cause of the problem I ran into is not related to this PR, but rather a more general issue with the exit/cleanup code and locked resources, and the conditions to run into it are pretty specific.

If this is likely to be a practical problem, it might make sense to have an --overlay-index [on|off] option to overrule the kernel default?

It might indeed make sense to have this sort of option (as well as for the other options overlayfs exposes—there are a few of them), but I'm reluctant to design and implement them without an actual need. Turning off the index for this issue is more of a workaround for not having --wait-for-pid-1 functionality (or whatever it should be called), at least as I understand it.

Agreed, I think bwrap should strive for simplicity and turning off the overlayfs index is just a workaround. If someone needs to tune their mounts for some specific use case, this can be done like in #412 (comment).

utils.c Show resolved Hide resolved
utils.c Show resolved Hide resolved
va_start (args, fmt);
len = vsnprintf (dest->str + dest->offset, dest->size - dest->offset, fmt, args);
va_end (args);
if (len < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could even add

Suggested change
if (len < 0)
if (len < 0 || dest->offset + len >= dest->size)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could, but why? The added test will always be true if nothing mucks around with the internals. This die_with_error is only here in case vsnprintf blows up.

bubblewrap.c Show resolved Hide resolved
@swick
Copy link
Contributor

swick commented Sep 22, 2023

A few nits but this looks great!

Just from a simplicity perspective I think it would be easier to follow what's going on if the ops were added directly and then processed in setup_newroot by, e.g. prending them to a list and then walk the list.

The other big thing is how a user like flatpak should test for availability of this feature. How is this done with other features?

@rhendric
Copy link
Author

@smcv, in #596 (comment) you said ‘solving its remaining issues’. As far as I'm aware the only outstanding issue here is your decision on whether the design choices we discussed in #547 (comment) are acceptable. Is there anything else for anyone to do? Let me know and I'll be happy to work on it.

@swick
Copy link
Contributor

swick commented Sep 23, 2023

FWIW, the order as implemented right now makes a lot of sense when you build the command line for it from a program like flatpak.

@swick
Copy link
Contributor

swick commented Sep 24, 2023

I have a flatpak branch now which is basically doing --overlay-src $path-to-runtime-src --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc and any subsequent attempt to mount something onto existing directories in /etc fails with a permission denied error because the directories are owned by nobody.

Letting bwrap chown the directory doesn't work and fails with EINVAL.

For me --tmp-overlay is not useful. What works here is to either add the file directly in a new upper layer:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --tmp-overlay /etc --file 14 /etc/existing-dir/file

or just bind mounting it without the tmp overlay like this:

--overlay-src $path-to-runtime-src --overlay-src $masking-dir --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --bind-data 14 /etc/existing-dir/file

In both cases $masking-dir contains a the file /etc/existing-dir/file with workable uid:gid.

Unfortunately the masking dir layer is on the oldroot. What I want, really is a way to add a layer from the newroot:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --file 14 /run/etc-overlay/existing-dir/file

This also works for bind mounting:

--overlay-src $path-to-runtime-src --overlay-src-newroot /run/etc-overlay --overlay-src $path-to-flatpak-helper-monitor --ro-overlay /etc --dir /run/etc-overlay/mount-point --bind $some-mount /etc/mount-point

Long story short, I think an option to create an overlayfs layer from a directory from the newroot would be really helpful.

@rhendric
Copy link
Author

rhendric commented Oct 6, 2023

Honestly, if the rules have changed between kernels, we should just reject strings that need escaping (and therefore would potentially have different meanings under different kernels). If we do that, we have the option of allowing them later (after the new semantics are widespread), and it would not be an incompatible change to do so.

Fair enough. The overlayfs driver handles : (because it's the layer separator) and \ (escape character) specially, and that hasn't changed across kernel versions (to date); if we escape those then , is the only character we need to reject, if we continue to use mount(2). Is that good, or do you want to reject : and \ as well and do no escaping whatsoever?

@swick
Copy link
Contributor

swick commented Oct 6, 2023

if we escape those then , is the only character we need to reject

Seems fine.

I don't think it's worth it to argue that the kernel should fix the regression then. If someone else runs into it and makes them fix it then good for us and we can lift the restriction but I don't think I'll spend time on arguing here.

@smcv
Copy link
Collaborator

smcv commented Oct 6, 2023

I'd prefer to reject : and \ until/unless we have clarity on whether the rules are stable.

@swick
Copy link
Contributor

swick commented Oct 6, 2023

Alright, pushing them a bit on the mailing list then.

@rhendric
Copy link
Author

rhendric commented Oct 7, 2023

I'd prefer to reject : and \ until/unless we have clarity on whether the rules are stable.

The response we got for those characters specifically is:

It has been this way for a long time. I see no reason for it to change in the future.

Is that enough assurance to proceed with escaping those two?

@swick
Copy link
Contributor

swick commented Oct 12, 2023

Seems like the regression is getting fixed after all. https://lore.kernel.org/linux-unionfs/20231012134428.1874373-1-amir73il@gmail.com/

@rhendric
Copy link
Author

Good news in the long-ish term (whatever window of time, if any, exists between ‘6.5 is too old for bwrap to care about supporting’ and ‘bwrap is ported to use the new mount API, in order to support filesystem user mapping or whatever’), but unfortunately that doesn't change anything for this PR.

@swick
Copy link
Contributor

swick commented Oct 16, 2023

6.5 is supposedly getting fixed as well. If the fix lands in 6.5 it should be fine to go ahead with your original escaping code.

@patlefort
Copy link

I'm running into this issue: bwrap --overlay-src root/files/root --overlay-src layer1/files/root --overlay layer2/files/root layer2/files/work / -- pwd result in bwrap: Can't make overlay mount on /newroot/ [fsconfig(name=lowerdir, value=:/oldroot/tmp/root/files/root)]: Invalid argument. Specifying more than one overlay-src always result in this error. I'm testing on Arch Linux.

@swick
Copy link
Contributor

swick commented Oct 23, 2023

@rhendric it landed in 6.5.8.

Revert the escaping changes and maybe pull in my two commits and then try for another round of review?

@joanbm
Copy link

joanbm commented Oct 23, 2023

The issue @patlefort mentions appears to be due to https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-6.5.y&id=56f16bda27aabc5be062e80d20fe78c2417f392a (included in Linux ≥6.5.8).

So it looks like the branch in overlay_mount to split colon-separated paths into multiple fsconfig calls needs to go for now.

@rhendric
Copy link
Author

@patlefort, I'll be reverting most of the changes in the second commit of this PR after I have a Linux 6.5.8 VM on which to test. Until then, if you use just 2b1f37e, you shouldn't see that issue.

@patlefort
Copy link

It works unless I use / as an overlay source. Is there something special about using the root of my file system as an overlay source?
Example:
bwrap --overlay-src / --overlay-src ~/Software/PhotoRT --ro-overlay / -- pwd
bwrap: Can't make overlay mount on /newroot/ with options lowerdir=/oldroot/home/elrick/Software/PhotoRT:/oldroot/: Invalid argument

@rhendric
Copy link
Author

‘Overlapping overlay layers are not supported and can cause unexpected behavior’

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=146d62e5a5867fbf84490d82455718bfb10fe824

So using / as an overlay-src is always incorrect, because that overlaps everything. Is it worth implementing a specific check for this case in bubblewrap, so that a more helpful error can be presented?

@swick
Copy link
Contributor

swick commented Oct 25, 2023

Figuring out when two layers are overlapping is a hard problem that adds a ton of complexity. There are also lots of other restrictions with overlayfs. In general bwrap doesn't try to validate that the kernel op won't fail but rather just tries and if something goes wrong aborts.

@rhendric
Copy link
Author

By ‘this case’ I meant someone literally providing /, which would be easy to check. The generic ‘Invalid argument’ error that you get if you screw this up is pretty unhelpful. (For other cases of overlapping, the kernel seems to give back an ELOOP instead, which is associated with an awful error message by default but it should be easy to test for that result and die with a better one.)

@swick
Copy link
Contributor

swick commented Oct 25, 2023

By ‘this case’ I meant someone literally providing /, which would be easy to check.

Why this error exactly and not any others? This is a low-level tool not meant to be used by end-users.

@rhendric
Copy link
Author

<shrug> Being a low-level tool doesn't mean paying zero attention to usability. I don't know whether this project would want to add a few lines of code to cover this particular gotcha that came up in testing, which is why I'm asking the question of whether it's worth it and not just pushing code that does it.

@patlefort
Copy link

It did take me by surprise so maybe adding a word about it in the manual could help.

@rhendric rhendric force-pushed the rhendric/overlayfs branch 2 times, most recently from 6d85ee1 to 6dddda5 Compare November 18, 2023 01:58
@joanbm
Copy link

joanbm commented Jan 27, 2024

We should probably set the userxattr overlay mount flag? Without it, some warnings are logged in dmesg, and some scenarios are not supported (source):

$ mkdir lower upper work merged
$ mkdir lower/directory
$ bwrap --dev-bind / / --overlay-src "$PWD"/lower/ --overlay "$PWD"/{upper,work,merged}/ \
    sh -c 'rmdir merged/directory && mkdir merged/directory'
mkdir: cannot create directory 'merged/directory': Input/output error

Podman also does it as well for unprivileged/rootless containers, see: https://github.com/containers/storage/blob/a0eeb0781129973c70d2cb101b00e312631a5702/drivers/overlay/overlay.go#L1723

rhendric and others added 4 commits April 21, 2024 23:42
This commit adds --overlay, --tmp-overlay, --ro-overlay, and
--overlay-src options to enable bubblewrap to create overlay mounts.
These options are only permitted when bubblewrap is not installed
setuid.

Resolves: containers#412
Co-authored-by: William Manley <will@williammanley.net>
Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
@rhendric
Copy link
Author

I've rebased onto the 0.9.0 release and implemented the userxattr flag; thanks for the suggestion.

@smcv, any chance this can get your eyes again?

bubblewrap.c Outdated
die ("Can't make overlay mount on %s with options %s: "
"Overlay directories may not overlap",
arg2, arg1);
die_with_error ("Can't make overlay mount on %s with options %s",
Copy link

@joanbm joanbm Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this die_with_error should be die_with_mount_error (you probably missed this when rebasing to 0.9.0).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks!

multi_src = TRUE;
}

strappend (&sb, ",userxattr");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Podman (https://github.com/containers/storage/blob/a0eeb0781129973c70d2cb101b00e312631a5702/drivers/overlay/overlay.go#L1723) does not add this flag if running as the real root (or rather: in the initial user namespace + has CAP_SYS_ADMIN). I don't think this matters here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't; this whole feature is disabled when running setuid.

Copy link

@joanbm joanbm Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind is the case where bubblewrap is run as root directly (without being elevated through setuid), at first glance it looks like it's supported, there's even someone using it and in this case is_privileged = FALSE so features which are disabled when running setuid, including overlay, still work.

Though IMO unless someone comes with a good reason why userxattr should not be used in this case, I'd rather just keep userxattr and not handle it specially, specially since the implementing the initial user namespace check is a bit of a PITA.

Signed-off-by: Ryan Hendrickson <ryan.hendrickson@alum.mit.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants