-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate away from rootfs / DOCKER_RAMDISK #3512
Comments
The security issues should be addressed now, right ? With opencontainers/runc@28a697c |
Mitigated but cc @cyphar |
Yes, I absolutely recommend against using |
OK. Moving away from Buildroot (already from Boot2Docker) probably means some work, though... From what we have seen so far, it will also make the footprint (in terms of the various disk images) larger. So it depends on how important it is to make the local VM secure - we don't use i.e. the default user has access to |
In theory the fix could be as simple as just having a |
Interesting! Today we have "everything" on the rootfs (besides the usual runtime suspects), and then have the docker/containers on a mounted disk (for persistance). Using overlay2/overlay storage drivers. Something like this: (with the cgroup/bind-mounts/overlay/shm excluded)
i.e. docker lives in
What new file system would be required, for it to be able to use
|
The issue is that
In fact (if you read |
But if we try to keep all the images on tmpfs, we would have <2G available (RAM) instead of >20G ? |
I'm not sure I understand -- |
Probably not, but it seems you are saying that we need to keep everything in memory or we need to keep everything on disk - the current mix with booting from memory and storing images on disk won't work ? |
Sorry, you don't need to keep everything in memory -- I misunderstood and thought you were doing that already. The only thing you need is for |
@cyphar : I'm not sure if Buildroot has such an option, or if anyone in minikube has the skills to do it... So the most likely approach is switching to a more common distro such as Debian or CentOS (like minishift). However, doing so makes minikube even more similar to using some other approach to provision a VM... And like I mentioned earlier, the current attempts to do it has so far also increased the footprint (by 2-3x) ? If someone can make tmpfs work, then please post it here. |
Here is the Fedora/CentOS method, if that helps: https://fedoraproject.org/wiki/LiveOS_image It uses device-mapper snapshots:
So that the root file system is "normal":
So there is no need to run with |
@afbjorklund The way you would have to do this is that you mount a The "easiest" fix would be to do a I can take a look at what changes would need to be done with Buildroot, but that would be the first thing I'd check. As I said above, you don't need to use a loopback filesystem or anything like that -- you just need to mount |
@cyphar : thanks for the help, we are currently using buildroot version 2018.05 (with systemd): https://git.busybox.net/buildroot/tree/?h=2018.05.x |
Finally I noticed which "bootcode" that Boot2Docker is using for tiny core linux, to use a tmpfs instead:
if mount -t tmpfs -o size=90% tmpfs /mnt; then
if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
mkdir /mnt/mnt
exec /sbin/switch_root mnt /sbin/init
fi
fi
exec /sbin/init |
If someone could help out with converting this from init to systemd, that would be appreciated. |
To check if If I understand this properly: putting the equivalent of the code snippet above in a systemd unit file in And it should also fix #4143 (as a side effect)? |
@massimiliano-mantione I believe the correct way of doing thus under systemd is with some Every distribution I know of already does this, so we just need to copy how they do it. In particular, on my system there is a # SPDX-License-Identifier: LGPL-2.1+
#
# This file is part of systemd.
#
# systemd is free software; you can redistribute it and/or modify it
# under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation; either version 2.1 of the License, or
# (at your option) any later version.
[Unit]
Description=Switch Root
DefaultDependencies=no
ConditionPathExists=/etc/initrd-release
OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly
AllowIsolate=yes
[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --no-block switch-root /sysroot So we just need to use All of this work is done by |
Bonus points are awarded for integrating this option into the Buildroot distribution, similar to how the "noembed" boot code (above) works for Tiny Core Linux... |
Apparently there is a magic "sysroot.mount" unit that does it: :
v
basic.target
|
______________________/|
/ |
| initrd-root-device.target
| |
| v
| sysroot.mount
| |
| v
| initrd-root-fs.target
| |
| v
v initrd-parse-etc.service
(custom initrd |
services...) v
| (sysroot-usr.mount and
| various mounts marked
| with fstab option
| x-initrd.mount...)
| |
| v
| initrd-fs.target
\______________________ |
\|
v
initrd.target
|
v
initrd-cleanup.service
isolates to
initrd-switch-root.target
|
v
______________________/|
/ v
| initrd-udevadm-cleanup-db.service
v |
(custom initrd |
services...) |
\______________________ |
\|
v
initrd-switch-root.target
|
v
initrd-switch-root.service
|
v
Transition to Host OS No idea how you implement that, though. It should copy from |
@kfox1111 : never mind, found the original source of the #!/bin/sh
# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@" Thank you for the suggestion, this will work out just fine. Before:
After:
Now need to change the rest of the configuration etc, but this should be doable. |
And it worked fine with |
If you can think of any reason why shared mounts break in minikube when they are first used, I'd really appreciate it. I'm struggling a bit trying to figure it out in #4072. I really thought it was this issue but seems unrelated. Thanks. |
No real ideas, sorry. Sounds unrelated? |
I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs? |
I can confirm DOCKER_RAMDISK is there so that rootfs works. With the init pivot to tmpfs from above, it is no longer required and allows shared mounts to work. We should remove it as part of this fix. |
Yes, that is related to NoPivotRoot: os.Getenv("DOCKER_RAMDISK") != "" https://github.com/moby/moby/blob/master/libcontainerd/remote/client.go#L205 |
Here is the same setting in crio.conf: # If true, the runtime will not use pivot_root, but instead use MS_MOVE.
no_pivot = true containerd: no_pivot = true buildah: export BUILDAH_NOPIVOT=true |
At least in the docker case, it looks like minikube may be injecting the docker.service file? It only seems to show up after minikube start gets to a certain point. |
All of them, actually. The default is false. (i.e. use https://github.com/kubernetes/minikube/blob/v1.3.1/pkg/provision/buildroot.go#L98_L99 |
ok. so its a fix to the iso and to the minikube program. |
Yeah, theoretically we could have minikube look at the mounted file system and adjust appropriately... That might be appreciated by people who are using older or forked version of the ISO for some reason. |
Does minikube do any templating on the files or just copy them right in? If its a straight copy, maybe we put the files inside the iso. If they exist, then copy them from the disabled dir to the final destination. If not, inject them. This would allow users to more easily customize them too. |
hmm... is it gotl? maybe the raw templates could be copied from the iso, templated out to the final version, then injected back to the final location? |
Here is the code for the dynamic runtime configuration: 5afa5a2 It will detect a non-rootfs partition, and avoid DOCKER_RAMDISK OOPS: we cannot use this code, since it needs to run over ssh Anyway, samething as the go code - but in shell instead :-) |
Easiest is using $ minikube ssh "df --output=fstype / | sed 1d"
rootfs So run that from go, instead of gopsutil, and adjust the go template accordingly. |
Is there a pr for the tmpfs init? |
There will be, eventually. Basically same as above, just tweaked it a bit. We could make it dynamic, but I think that is overkill. Used |
I just verified my csi driver is working with the fixes in place. So excited to get a release with this in place so everyone can csi. :) |
These two, merged together. buildroot: https://github.com/buildroot/buildroot/blob/master/fs/cpio/init # devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@" tinycore: https://github.com/tinycorelinux/Core-scripts/blob/master/init if mount -t tmpfs -o size=90% tmpfs /mnt; then
if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
mkdir /mnt/mnt
exec /sbin/switch_root mnt /sbin/init
fi
fi
exec /sbin/init Probably |
Why not use |
That works too, thanks for the tip! Now looks like: $ minikube ssh -- stat --file-system --format '%T' /
tmpfs |
Unfortunately I forgot to check that it still worked for the old ISO (it didn't): $ minikube ssh -- stat --file-system --format '%T' /
ramfs |
Huh. It looks like "ramfs" is what |
This appears to be fixed at head. Please re-open if I am mistaken:
|
This will allow us to use pivot_root and mitigate security issues which involve escaping containers.
The text was updated successfully, but these errors were encountered: