Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate away from rootfs / DOCKER_RAMDISK #3512

Closed
tstromberg opened this issue Jan 8, 2019 · 69 comments
Closed

Migrate away from rootfs / DOCKER_RAMDISK #3512

tstromberg opened this issue Jan 8, 2019 · 69 comments
Assignees
Labels
area/guest-vm General configuration issues with the minikube guest VM help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/security security issues priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. r/2019q2 Issue was last reviewed 2019q2
Milestone

Comments

@tstromberg
Copy link
Contributor

This will allow us to use pivot_root and mitigate security issues which involve escaping containers.

@tstromberg tstromberg added kind/security security issues area/guest-vm General configuration issues with the minikube guest VM labels Jan 8, 2019
@tstromberg tstromberg added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 23, 2019
@tstromberg tstromberg changed the title Migrate away from DOCKER_RAMDISK Migrate away from rootfs / DOCKER_RAMDISK Jan 29, 2019
@afbjorklund
Copy link
Collaborator

The security issues should be addressed now, right ? With opencontainers/runc@28a697c

@afbjorklund afbjorklund added the kind/design Categorizes issue or PR as related to design. label Feb 22, 2019
@AkihiroSuda
Copy link
Member

Mitigated but chroot is still unrecommended

cc @cyphar

@cyphar
Copy link

cyphar commented Feb 23, 2019

Yes, I absolutely recommend against using chroot. It's simply not secure, and continuing to use it is a bad idea -- the security you get from chroot is incredibly minimal compared to the security you get from pivot_root. I'm actually of half a mind to remove chroot support from runc entirely, that's how bad of an idea it is to use it.

@afbjorklund
Copy link
Collaborator

afbjorklund commented Feb 23, 2019

OK. Moving away from Buildroot (already from Boot2Docker) probably means some work, though... From what we have seen so far, it will also make the footprint (in terms of the various disk images) larger. So it depends on how important it is to make the local VM secure - we don't use no_pivot_root for remote.

i.e. the default user has access to sudo (through %wheel) and the default user has a known password...

@cyphar
Copy link

cyphar commented Feb 23, 2019

In theory the fix could be as simple as just having a tmpfs for Docker. I don't know how the images are stored (and if they're baked into images that might be difficult) but since initramfs is already in-memory there isn't a difference with tmpfs other than the fact you need to load it on-boot.

@afbjorklund
Copy link
Collaborator

Interesting! Today we have "everything" on the rootfs (besides the usual runtime suspects), and then have the docker/containers on a mounted disk (for persistance). Using overlay2/overlay storage drivers.

Something like this: (with the cgroup/bind-mounts/overlay/shm excluded)

Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
devtmpfs        906M     0  906M   0% /dev
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
tmpfs           996M     0  996M   0% /dev/shm
devpts             0     0     0    - /dev/pts
tmpfs           996M  102M  895M  11% /run
tmpfs           996M     0  996M   0% /sys/fs/cgroup
hugetlbfs          0     0     0    - /dev/hugepages
nfsd               0     0     0    - /proc/fs/nfsd
mqueue             0     0     0    - /dev/mqueue
fusectl            0     0     0    - /sys/fs/fuse/connections
debugfs            0     0     0    - /sys/kernel/debug
tmpfs           996M   28K  996M   1% /tmp
/dev/sda1        17G  1.5G   14G  10% /mnt/sda1

i.e. docker lives in /var/lib/docker and crio lives in /var/lib/containers

TARGET              SOURCE                         FSTYPE OPTIONS
/var/lib/docker     /dev/sda1[/var/lib/docker]     ext4   rw,relatime,data=ordered
/var/lib/containers /dev/sda1[/var/lib/containers] ext4   rw,relatime,data=ordered

What new file system would be required, for it to be able to use pivot_root ?

container create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\""

@cyphar
Copy link

cyphar commented Feb 23, 2019

The issue is that / is the type rootfs. If you added a tmpfs mount for /var/lib/docker or you switch / to be a full rootfs (which is the case where the image size would increase). To quote the comment from the pivot_root source:

 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
 * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives
 * in this situation.

In fact (if you read Documentation/filesystems/ramfs-rootfs-initramfs.txt), rootfs is precisely identical to tmpfs -- it's just a special case which cannot be unmounted. This is actually the reason you can't pivot_root with rootfs -- because you cannot move the mount by design (it would be like killing pid1).

@afbjorklund
Copy link
Collaborator

But if we try to keep all the images on tmpfs, we would have <2G available (RAM) instead of >20G ?
And it would die on reboot, and all the other fun stuff. So I am not sure that approach is doable.

@cyphar
Copy link

cyphar commented Feb 23, 2019

I'm not sure I understand -- rootfs is a tmpfs. It's all in-memory in either case and /var is on rootfs so there isn't a difference -- or am I misunderstanding something?

@afbjorklund
Copy link
Collaborator

Probably not, but it seems you are saying that we need to keep everything in memory or we need to keep everything on disk - the current mix with booting from memory and storing images on disk won't work ?

@cyphar
Copy link

cyphar commented Feb 23, 2019

Sorry, you don't need to keep everything in memory -- I misunderstood and thought you were doing that already. The only thing you need is for / to not be rootfs. This can be done by just mounting tmpfs on top of rootfs in very early boot, or any of the other ideas mentioned in Documentation/filesystems/ramfs-rootfs-initramfs.txt.

@afbjorklund
Copy link
Collaborator

@cyphar : I'm not sure if Buildroot has such an option, or if anyone in minikube has the skills to do it... So the most likely approach is switching to a more common distro such as Debian or CentOS (like minishift).

However, doing so makes minikube even more similar to using some other approach to provision a VM... And like I mentioned earlier, the current attempts to do it has so far also increased the footprint (by 2-3x) ?

If someone can make tmpfs work, then please post it here.

@afbjorklund
Copy link
Collaborator

Here is the Fedora/CentOS method, if that helps: https://fedoraproject.org/wiki/LiveOS_image

It uses device-mapper snapshots:

live-base: 0 20971520 linear 
live-osimg-min: 0 20971520 snapshot 8608/8608 48
live-rw: 0 20971520 snapshot 383648/67108864 1504

So that the root file system is "normal":

TARGET              SOURCE                         FSTYPE OPTIONS
/                   /dev/mapper/live-rw            ext4   rw,noatime,seclabel
/var/lib/containers /dev/sda1[/var/lib/containers] xfs    rw,relatime,seclabel,attr2,inode64,noquota

So there is no need to run with no_pivot_root, but the ISO images are bigger than with rootfs.

@cyphar
Copy link

cyphar commented Feb 24, 2019

@afbjorklund The way you would have to do this is that you mount a tmpfs which you then fill with your rootfs (rather than doing that with /) and then doing an MS_MOVE over / so that you can then use it. There is a helper program called switch_root which is installed on most systems that does this for you (it's also a library function in a few things).

The "easiest" fix would be to do a cp -R of the imporant things on / into a new tmpfs and then switch_root to it (switch_root recursively deletes everything on the old filesystem). But you have to do this before anything else is mounted -- so you'd need to adjust your init system to do this (if you're using systemd I think it has some way of specifying that you want it to do this and it'll do it for you).

I can take a look at what changes would need to be done with Buildroot, but that would be the first thing I'd check.

As I said above, you don't need to use a loopback filesystem or anything like that -- you just need to mount tmpfs over the rootfs before anything happens on the system and it will work, because tmpfs and rootfs are completely identical except that rootfs doesn't have a parent mount (which makes pivot_root deny you from switching).

@afbjorklund
Copy link
Collaborator

@cyphar : thanks for the help, we are currently using buildroot version 2018.05 (with systemd):

https://git.busybox.net/buildroot/tree/?h=2018.05.x
deploy/iso/minikube-iso/configs/minikube_defconfig

@tstromberg tstromberg added the r/2019q2 Issue was last reviewed 2019q2 label Apr 4, 2019
@afbjorklund
Copy link
Collaborator

afbjorklund commented Apr 24, 2019

Finally I noticed which "bootcode" that Boot2Docker is using for tiny core linux, to use a tmpfs instead:

# noembed: put / on a tmpfs instead of the kernel "rootfs" (ramdisk);

10.32. noembed - use a separate tmpfs

This is an advanced option that changes where in RAM Core is run
from. By default, Core uses the tmpfs setup by the kernel; with this
bootcode, Core will setup a new tmpfs file system, and use that
instead.

Using this bootcode temporarily doubles the RAM use, as both
copies are kept in RAM at once during boot. As an extra copy is
made, it also slows the boot time. It allows GNU df to detect the
free space in /, used by some proprietary software installers.

Code: https://github.com/tinycorelinux/Core-scripts/blob/3013492508569a36fbb05a8a00cd90f38619f414/init#L13:L19

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

@afbjorklund
Copy link
Collaborator

If someone could help out with converting this from init to systemd, that would be appreciated.

@massimiliano-mantione
Copy link

To check if If I understand this properly: putting the equivalent of the code snippet above in a systemd unit file in deploy/iso/minikube-iso/board/coreos/minikube/rootfs-overlay/etc/systemd which needs to be executed "before anything else" will fix this?

And it should also fix #4143 (as a side effect)?

@cyphar
Copy link

cyphar commented Apr 26, 2019

@massimiliano-mantione I believe the correct way of doing thus under systemd is with some initrd configuration magic (at least that's my understanding of this page). There is also already a systemctl switch-root command which you could use without needing to write the mount code yourself.

Every distribution I know of already does this, so we just need to copy how they do it. In particular, on my system there is a /usr/lib/systemd/system/initrd-switch-root.service (which is included as part of a systemd install) which does this:

#  SPDX-License-Identifier: LGPL-2.1+
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Switch Root
DefaultDependencies=no
ConditionPathExists=/etc/initrd-release
OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly
AllowIsolate=yes

[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --no-block switch-root /sysroot

So we just need to use initrd-switch-root.service (there's also initrd-switch-root.target but I'm not sure how to configure the bootup target transitions).

All of this work is done by dracut on systemd-based distributions, so maybe we should look at using that for building our initramfs?

@afbjorklund
Copy link
Collaborator

Bonus points are awarded for integrating this option into the Buildroot distribution, similar to how the "noembed" boot code (above) works for Tiny Core Linux...

@afbjorklund
Copy link
Collaborator

afbjorklund commented Apr 27, 2019

Apparently there is a magic "sysroot.mount" unit that does it:
https://www.freedesktop.org/software/systemd/man/bootup.html

                                               :
                                               v
                                         basic.target
                                               |
                        ______________________/|
                       /                       |
                       |            initrd-root-device.target
                       |                       |
                       |                       v
                       |                  sysroot.mount
                       |                       |
                       |                       v
                       |             initrd-root-fs.target
                       |                       |
                       |                       v
                       v            initrd-parse-etc.service
                (custom initrd                 |
                 services...)                  v
                       |            (sysroot-usr.mount and
                       |             various mounts marked
                       |               with fstab option
                       |              x-initrd.mount...)
                       |                       |
                       |                       v
                       |                initrd-fs.target
                       \______________________ |
                                              \|
                                               v
                                          initrd.target
                                               |
                                               v
                                     initrd-cleanup.service
                                          isolates to
                                    initrd-switch-root.target
                                               |
                                               v
                        ______________________/|
                       /                       v
                       |        initrd-udevadm-cleanup-db.service
                       v                       |
                (custom initrd                 |
                 services...)                  |
                       \______________________ |
                                              \|
                                               v
                                   initrd-switch-root.target
                                               |
                                               v
                                   initrd-switch-root.service
                                               |
                                               v
                                     Transition to Host OS

No idea how you implement that, though. It should copy from / to /sysroot.
It seems that this file is "normally" being generated by dracut at boot time:

https://github.com/dracutdevs/dracut/blob/bca1967c90967d5453d8b215ff28552776e4fcb3/modules.d/98dracut-systemd/rootfs-generator.sh

@afbjorklund
Copy link
Collaborator

@kfox1111 : never mind, found the original source of the /dev/console code.

#!/bin/sh
# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

Thank you for the suggestion, this will work out just fine.

Before:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         425          48          16        1469        1382
Swap:             0           0           0

After:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.8G  567M  1.2G  33% /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         516          69         584        1357         830
Swap:             0           0           0

Now need to change the rest of the configuration etc, but this should be doable.

@afbjorklund
Copy link
Collaborator

And it worked fine with /sbin/switch_root, no need to build util-linux switch_root.

@kfox1111
Copy link

If you can think of any reason why shared mounts break in minikube when they are first used, I'd really appreciate it. I'm struggling a bit trying to figure it out in #4072. I really thought it was this issue but seems unrelated. Thanks.

@afbjorklund
Copy link
Collaborator

No real ideas, sorry. Sounds unrelated?

@kfox1111
Copy link

kfox1111 commented Aug 14, 2019

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

@kfox1111
Copy link

I can confirm DOCKER_RAMDISK is there so that rootfs works. With the init pivot to tmpfs from above, it is no longer required and allows shared mounts to work. We should remove it as part of this fix.

@afbjorklund
Copy link
Collaborator

afbjorklund commented Aug 14, 2019

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

Yes, that is related to --no-pivot (same as no_pivot_root = true in podman)

NoPivotRoot: os.Getenv("DOCKER_RAMDISK") != ""

https://github.com/moby/moby/blob/master/libcontainerd/remote/client.go#L205

@afbjorklund
Copy link
Collaborator

afbjorklund commented Aug 14, 2019

Here is the same setting in crio.conf:

# If true, the runtime will not use pivot_root, but instead use MS_MOVE.
no_pivot = true

containerd:

      no_pivot = true

buildah:

export BUILDAH_NOPIVOT=true

@kfox1111
Copy link

kfox1111 commented Aug 14, 2019

At least in the docker case, it looks like minikube may be injecting the docker.service file?

It only seems to show up after minikube start gets to a certain point.

@afbjorklund
Copy link
Collaborator

afbjorklund commented Aug 14, 2019

At least in the docker case, it looks like minikube may be injecting the docker.service file?

All of them, actually. The default is false. (i.e. use pivot_root)

https://github.com/kubernetes/minikube/blob/v1.3.1/pkg/provision/buildroot.go#L98_L99

@kfox1111
Copy link

ok. so its a fix to the iso and to the minikube program.

@afbjorklund
Copy link
Collaborator

Yeah, theoretically we could have minikube look at the mounted file system and adjust appropriately...

That might be appreciated by people who are using older or forked version of the ISO for some reason.

@kfox1111
Copy link

Does minikube do any templating on the files or just copy them right in?

If its a straight copy, maybe we put the files inside the iso. If they exist, then copy them from the disabled dir to the final destination. If not, inject them. This would allow users to more easily customize them too.

@afbjorklund
Copy link
Collaborator

afbjorklund commented Aug 14, 2019

It's templated, unfortunately. This also has the side effect that you can't reboot the VM yourself.

See #1851 (it has been there since day one: e8a60b9)

@kfox1111
Copy link

hmm... is it gotl? maybe the raw templates could be copied from the iso, templated out to the final version, then injected back to the final location?

@afbjorklund
Copy link
Collaborator

afbjorklund commented Aug 14, 2019

Here is the code for the dynamic runtime configuration: 5afa5a2

It will detect a non-rootfs partition, and avoid DOCKER_RAMDISK

OOPS: we cannot use this code, since it needs to run over ssh

Anyway, samething as the go code - but in shell instead :-)

@afbjorklund
Copy link
Collaborator

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):

$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

So run that from go, instead of gopsutil, and adjust the go template accordingly.

@kfox1111
Copy link

Is there a pr for the tmpfs init?

@afbjorklund
Copy link
Collaborator

Is there a pr for the tmpfs init?

There will be, eventually.

Basically same as above, just tweaked it a bit. We could make it dynamic, but I think that is overkill.
That is, honor the: grep -qw noembed /proc/cmdline (we already have "noembed" - but ignore it)

Used /sbin/switch_root.

@kfox1111
Copy link

I just verified my csi driver is working with the fixes in place. So excited to get a release with this in place so everyone can csi. :)

@afbjorklund
Copy link
Collaborator

These two, merged together.

buildroot: https://github.com/buildroot/buildroot/blob/master/fs/cpio/init

# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

tinycore: https://github.com/tinycorelinux/Core-scripts/blob/master/init

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

Probably /sysroot, not /mnt.

@cyphar
Copy link

cyphar commented Aug 22, 2019

@afbjorklund

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):

$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

@afbjorklund
Copy link
Collaborator

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

That works too, thanks for the tip! Now looks like:

$ minikube ssh -- stat --file-system --format '%T' /
tmpfs

@afbjorklund
Copy link
Collaborator

Unfortunately I forgot to check that it still worked for the old ISO (it didn't):

$ minikube ssh -- stat --file-system --format '%T' /
ramfs

@cyphar
Copy link

cyphar commented Aug 25, 2019

Huh. It looks like "ramfs" is what stat calls "rootfs" (or rather, initramfs). Fundamentally both the df and stat solution are using the same syscall (statfs(2)) and checking what the filesystem magic number is. Arguably "ramfs" is the correct name, given the filesystem magic number is called RAMFS_MAGIC.

@tstromberg
Copy link
Contributor Author

This appears to be fixed at head. Please re-open if I am mistaken:

$ stat --file-system --format '%T' /
tmpfs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/guest-vm General configuration issues with the minikube guest VM help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/security security issues priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. r/2019q2 Issue was last reviewed 2019q2
Projects
None yet
Development

No branches or pull requests

8 participants