Migrate away from rootfs / DOCKER_RAMDISK #3512

tstromberg · 2019-01-08T17:21:25Z

This will allow us to use pivot_root and mitigate security issues which involve escaping containers.

afbjorklund · 2019-02-22T17:45:55Z

The security issues should be addressed now, right ? With opencontainers/runc@28a697c

AkihiroSuda · 2019-02-22T17:52:06Z

Mitigated but chroot is still unrecommended

cyphar · 2019-02-23T02:22:16Z

Yes, I absolutely recommend against using chroot. It's simply not secure, and continuing to use it is a bad idea -- the security you get from chroot is incredibly minimal compared to the security you get from pivot_root. I'm actually of half a mind to remove chroot support from runc entirely, that's how bad of an idea it is to use it.

afbjorklund · 2019-02-23T08:31:00Z

OK. Moving away from Buildroot (already from Boot2Docker) probably means some work, though... From what we have seen so far, it will also make the footprint (in terms of the various disk images) larger. So it depends on how important it is to make the local VM secure - we don't use no_pivot_root for remote.

i.e. the default user has access to sudo (through %wheel) and the default user has a known password...

cyphar · 2019-02-23T14:23:31Z

In theory the fix could be as simple as just having a tmpfs for Docker. I don't know how the images are stored (and if they're baked into images that might be difficult) but since initramfs is already in-memory there isn't a difference with tmpfs other than the fact you need to load it on-boot.

afbjorklund · 2019-02-23T15:10:06Z

Interesting! Today we have "everything" on the rootfs (besides the usual runtime suspects), and then have the docker/containers on a mounted disk (for persistance). Using overlay2/overlay storage drivers.

Something like this: (with the cgroup/bind-mounts/overlay/shm excluded)

Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
devtmpfs        906M     0  906M   0% /dev
sysfs              0     0     0    - /sys
proc               0     0     0    - /proc
tmpfs           996M     0  996M   0% /dev/shm
devpts             0     0     0    - /dev/pts
tmpfs           996M  102M  895M  11% /run
tmpfs           996M     0  996M   0% /sys/fs/cgroup
hugetlbfs          0     0     0    - /dev/hugepages
nfsd               0     0     0    - /proc/fs/nfsd
mqueue             0     0     0    - /dev/mqueue
fusectl            0     0     0    - /sys/fs/fuse/connections
debugfs            0     0     0    - /sys/kernel/debug
tmpfs           996M   28K  996M   1% /tmp
/dev/sda1        17G  1.5G   14G  10% /mnt/sda1

i.e. docker lives in /var/lib/docker and crio lives in /var/lib/containers

TARGET              SOURCE                         FSTYPE OPTIONS
/var/lib/docker     /dev/sda1[/var/lib/docker]     ext4   rw,relatime,data=ordered
/var/lib/containers /dev/sda1[/var/lib/containers] ext4   rw,relatime,data=ordered

What new file system would be required, for it to be able to use pivot_root ?

container create failed: container_linux.go:344: starting container process caused "process_linux.go:424: container init caused \"rootfs_linux.go:109: jailing process inside rootfs caused \\\"pivot_root invalid argument\\\"\""

cyphar · 2019-02-23T15:18:09Z

The issue is that / is the type rootfs. If you added a tmpfs mount for /var/lib/docker or you switch / to be a full rootfs (which is the case where the image size would increase). To quote the comment from the pivot_root source:

 * Also, the current root cannot be on the 'rootfs' (initial ramfs) filesystem.
 * See Documentation/filesystems/ramfs-rootfs-initramfs.txt for alternatives
 * in this situation.

In fact (if you read Documentation/filesystems/ramfs-rootfs-initramfs.txt), rootfs is precisely identical to tmpfs -- it's just a special case which cannot be unmounted. This is actually the reason you can't pivot_root with rootfs -- because you cannot move the mount by design (it would be like killing pid1).

afbjorklund · 2019-02-23T15:23:23Z

But if we try to keep all the images on tmpfs, we would have <2G available (RAM) instead of >20G ?
And it would die on reboot, and all the other fun stuff. So I am not sure that approach is doable.

cyphar · 2019-02-23T15:26:02Z

I'm not sure I understand -- rootfs is a tmpfs. It's all in-memory in either case and /var is on rootfs so there isn't a difference -- or am I misunderstanding something?

afbjorklund · 2019-02-23T15:28:50Z

Probably not, but it seems you are saying that we need to keep everything in memory or we need to keep everything on disk - the current mix with booting from memory and storing images on disk won't work ?

cyphar · 2019-02-23T15:31:58Z

Sorry, you don't need to keep everything in memory -- I misunderstood and thought you were doing that already. The only thing you need is for / to not be rootfs. This can be done by just mounting tmpfs on top of rootfs in very early boot, or any of the other ideas mentioned in Documentation/filesystems/ramfs-rootfs-initramfs.txt.

afbjorklund · 2019-02-23T16:27:38Z

@cyphar : I'm not sure if Buildroot has such an option, or if anyone in minikube has the skills to do it... So the most likely approach is switching to a more common distro such as Debian or CentOS (like minishift).

However, doing so makes minikube even more similar to using some other approach to provision a VM... And like I mentioned earlier, the current attempts to do it has so far also increased the footprint (by 2-3x) ?

If someone can make tmpfs work, then please post it here.

afbjorklund · 2019-02-23T16:38:15Z

Here is the Fedora/CentOS method, if that helps: https://fedoraproject.org/wiki/LiveOS_image

It uses device-mapper snapshots:

live-base: 0 20971520 linear 
live-osimg-min: 0 20971520 snapshot 8608/8608 48
live-rw: 0 20971520 snapshot 383648/67108864 1504

So that the root file system is "normal":

TARGET              SOURCE                         FSTYPE OPTIONS
/                   /dev/mapper/live-rw            ext4   rw,noatime,seclabel
/var/lib/containers /dev/sda1[/var/lib/containers] xfs    rw,relatime,seclabel,attr2,inode64,noquota

So there is no need to run with no_pivot_root, but the ISO images are bigger than with rootfs.

cyphar · 2019-02-24T14:22:08Z

@afbjorklund The way you would have to do this is that you mount a tmpfs which you then fill with your rootfs (rather than doing that with /) and then doing an MS_MOVE over / so that you can then use it. There is a helper program called switch_root which is installed on most systems that does this for you (it's also a library function in a few things).

The "easiest" fix would be to do a cp -R of the imporant things on / into a new tmpfs and then switch_root to it (switch_root recursively deletes everything on the old filesystem). But you have to do this before anything else is mounted -- so you'd need to adjust your init system to do this (if you're using systemd I think it has some way of specifying that you want it to do this and it'll do it for you).

I can take a look at what changes would need to be done with Buildroot, but that would be the first thing I'd check.

As I said above, you don't need to use a loopback filesystem or anything like that -- you just need to mount tmpfs over the rootfs before anything happens on the system and it will work, because tmpfs and rootfs are completely identical except that rootfs doesn't have a parent mount (which makes pivot_root deny you from switching).

afbjorklund · 2019-02-24T15:01:29Z

@cyphar : thanks for the help, we are currently using buildroot version 2018.05 (with systemd):

https://git.busybox.net/buildroot/tree/?h=2018.05.x
deploy/iso/minikube-iso/configs/minikube_defconfig

afbjorklund · 2019-04-24T20:31:36Z

Finally I noticed which "bootcode" that Boot2Docker is using for tiny core linux, to use a tmpfs instead:

# noembed: put / on a tmpfs instead of the kernel "rootfs" (ramdisk);

10.32. noembed - use a separate tmpfs

This is an advanced option that changes where in RAM Core is run
from. By default, Core uses the tmpfs setup by the kernel; with this
bootcode, Core will setup a new tmpfs file system, and use that
instead.

Using this bootcode temporarily doubles the RAM use, as both
copies are kept in RAM at once during boot. As an extra copy is
made, it also slows the boot time. It allows GNU df to detect the
free space in /, used by some proprietary software installers.

Code: https://github.com/tinycorelinux/Core-scripts/blob/3013492508569a36fbb05a8a00cd90f38619f414/init#L13:L19

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

afbjorklund · 2019-04-24T20:49:48Z

If someone could help out with converting this from init to systemd, that would be appreciated.

massimiliano-mantione · 2019-04-26T07:07:26Z

To check if If I understand this properly: putting the equivalent of the code snippet above in a systemd unit file in deploy/iso/minikube-iso/board/coreos/minikube/rootfs-overlay/etc/systemd which needs to be executed "before anything else" will fix this?

And it should also fix #4143 (as a side effect)?

cyphar · 2019-04-26T12:56:26Z

@massimiliano-mantione I believe the correct way of doing thus under systemd is with some initrd configuration magic (at least that's my understanding of this page). There is also already a systemctl switch-root command which you could use without needing to write the mount code yourself.

Every distribution I know of already does this, so we just need to copy how they do it. In particular, on my system there is a /usr/lib/systemd/system/initrd-switch-root.service (which is included as part of a systemd install) which does this:

#  SPDX-License-Identifier: LGPL-2.1+
#
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Switch Root
DefaultDependencies=no
ConditionPathExists=/etc/initrd-release
OnFailure=emergency.target
OnFailureJobMode=replace-irreversibly
AllowIsolate=yes

[Service]
Type=oneshot
ExecStart=/usr/bin/systemctl --no-block switch-root /sysroot

So we just need to use initrd-switch-root.service (there's also initrd-switch-root.target but I'm not sure how to configure the bootup target transitions).

All of this work is done by dracut on systemd-based distributions, so maybe we should look at using that for building our initramfs?

afbjorklund · 2019-04-26T13:08:58Z

Bonus points are awarded for integrating this option into the Buildroot distribution, similar to how the "noembed" boot code (above) works for Tiny Core Linux...

afbjorklund · 2019-04-27T16:39:46Z

Apparently there is a magic "sysroot.mount" unit that does it:
https://www.freedesktop.org/software/systemd/man/bootup.html

                                               :
                                               v
                                         basic.target
                                               |
                        ______________________/|
                       /                       |
                       |            initrd-root-device.target
                       |                       |
                       |                       v
                       |                  sysroot.mount
                       |                       |
                       |                       v
                       |             initrd-root-fs.target
                       |                       |
                       |                       v
                       v            initrd-parse-etc.service
                (custom initrd                 |
                 services...)                  v
                       |            (sysroot-usr.mount and
                       |             various mounts marked
                       |               with fstab option
                       |              x-initrd.mount...)
                       |                       |
                       |                       v
                       |                initrd-fs.target
                       \______________________ |
                                              \|
                                               v
                                          initrd.target
                                               |
                                               v
                                     initrd-cleanup.service
                                          isolates to
                                    initrd-switch-root.target
                                               |
                                               v
                        ______________________/|
                       /                       v
                       |        initrd-udevadm-cleanup-db.service
                       v                       |
                (custom initrd                 |
                 services...)                  |
                       \______________________ |
                                              \|
                                               v
                                   initrd-switch-root.target
                                               |
                                               v
                                   initrd-switch-root.service
                                               |
                                               v
                                     Transition to Host OS

No idea how you implement that, though. It should copy from / to /sysroot.
It seems that this file is "normally" being generated by dracut at boot time:

https://github.com/dracutdevs/dracut/blob/bca1967c90967d5453d8b215ff28552776e4fcb3/modules.d/98dracut-systemd/rootfs-generator.sh

afbjorklund · 2019-08-14T15:28:20Z

@kfox1111 : never mind, found the original source of the /dev/console code.

#!/bin/sh
# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

Thank you for the suggestion, this will work out just fine.

Before:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
rootfs             0     0     0    - /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         425          48          16        1469        1382
Swap:             0           0           0

After:

$ df -h /
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           1.8G  567M  1.2G  33% /
$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1942         516          69         584        1357         830
Swap:             0           0           0

Now need to change the rest of the configuration etc, but this should be doable.

afbjorklund · 2019-08-14T15:30:02Z

And it worked fine with /sbin/switch_root, no need to build util-linux switch_root.

kfox1111 · 2019-08-14T15:59:05Z

If you can think of any reason why shared mounts break in minikube when they are first used, I'd really appreciate it. I'm struggling a bit trying to figure it out in #4072. I really thought it was this issue but seems unrelated. Thanks.

afbjorklund · 2019-08-14T17:32:20Z

No real ideas, sorry. Sounds unrelated?

kfox1111 · 2019-08-14T17:53:52Z

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

kfox1111 · 2019-08-14T18:23:34Z

I can confirm DOCKER_RAMDISK is there so that rootfs works. With the init pivot to tmpfs from above, it is no longer required and allows shared mounts to work. We should remove it as part of this fix.

afbjorklund · 2019-08-14T18:28:55Z

I think I tracked it down, in part, to Environment=DOCKER_RAMDISK=yes being in the docker.service. Was this because of rootfs?

Yes, that is related to --no-pivot (same as no_pivot_root = true in podman)

NoPivotRoot: os.Getenv("DOCKER_RAMDISK") != ""

https://github.com/moby/moby/blob/master/libcontainerd/remote/client.go#L205

afbjorklund · 2019-08-14T18:34:11Z

Here is the same setting in crio.conf:

# If true, the runtime will not use pivot_root, but instead use MS_MOVE.
no_pivot = true

containerd:

      no_pivot = true

buildah:

export BUILDAH_NOPIVOT=true

kfox1111 · 2019-08-14T18:36:52Z

At least in the docker case, it looks like minikube may be injecting the docker.service file?

It only seems to show up after minikube start gets to a certain point.

afbjorklund · 2019-08-14T18:37:47Z

At least in the docker case, it looks like minikube may be injecting the docker.service file?

All of them, actually. The default is false. (i.e. use pivot_root)

https://github.com/kubernetes/minikube/blob/v1.3.1/pkg/provision/buildroot.go#L98_L99

kfox1111 · 2019-08-14T18:39:22Z

ok. so its a fix to the iso and to the minikube program.

afbjorklund · 2019-08-14T18:41:02Z

Yeah, theoretically we could have minikube look at the mounted file system and adjust appropriately...

That might be appreciated by people who are using older or forked version of the ISO for some reason.

kfox1111 · 2019-08-14T18:42:41Z

Does minikube do any templating on the files or just copy them right in?

If its a straight copy, maybe we put the files inside the iso. If they exist, then copy them from the disabled dir to the final destination. If not, inject them. This would allow users to more easily customize them too.

afbjorklund · 2019-08-14T18:46:44Z

It's templated, unfortunately. This also has the side effect that you can't reboot the VM yourself.

See ~~#1851~~ (it has been there since day one: e8a60b9)

kfox1111 · 2019-08-14T18:50:01Z

hmm... is it gotl? maybe the raw templates could be copied from the iso, templated out to the final version, then injected back to the final location?

afbjorklund · 2019-08-14T19:59:34Z

Here is the code for the dynamic runtime configuration: 5afa5a2

It will detect a non-rootfs partition, and avoid DOCKER_RAMDISK

OOPS: we cannot use this code, since it needs to run over ssh

Anyway, samething as the go code - but in shell instead :-)

afbjorklund · 2019-08-14T20:24:19Z

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):

$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

So run that from go, instead of gopsutil, and adjust the go template accordingly.

kfox1111 · 2019-08-14T20:47:08Z

Is there a pr for the tmpfs init?

afbjorklund · 2019-08-14T20:56:05Z

Is there a pr for the tmpfs init?

There will be, eventually.

Basically same as above, just tweaked it a bit. We could make it dynamic, but I think that is overkill.
That is, honor the: grep -qw noembed /proc/cmdline (we already have "noembed" - but ignore it)

Used /sbin/switch_root.

kfox1111 · 2019-08-14T20:57:49Z

I just verified my csi driver is working with the fixes in place. So excited to get a release with this in place so everyone can csi. :)

afbjorklund · 2019-08-14T21:03:03Z

These two, merged together.

buildroot: https://github.com/buildroot/buildroot/blob/master/fs/cpio/init

# devtmpfs does not get automounted for initramfs
/bin/mount -t devtmpfs devtmpfs /dev
exec 0</dev/console
exec 1>/dev/console
exec 2>/dev/console
exec /sbin/init "$@"

tinycore: https://github.com/tinycorelinux/Core-scripts/blob/master/init

if mount -t tmpfs -o size=90% tmpfs /mnt; then
  if tar -C / --exclude=mnt -cf - . | tar -C /mnt/ -xf - ; then
    mkdir /mnt/mnt
    exec /sbin/switch_root mnt /sbin/init
  fi
fi
exec /sbin/init

Probably /sysroot, not /mnt.

cyphar · 2019-08-22T07:54:04Z

@afbjorklund

Easiest is using df (from GNU coreutils), and filter out the header (as preferred):
$ minikube ssh "df --output=fstype / | sed 1d"
rootfs

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

afbjorklund · 2019-08-22T16:15:57Z

Why not use statfs(2) directly (or stat --file-system --format '%T' /)?

That works too, thanks for the tip! Now looks like:

$ minikube ssh -- stat --file-system --format '%T' /
tmpfs

afbjorklund · 2019-08-24T14:05:36Z

Unfortunately I forgot to check that it still worked for the old ISO (it didn't):

$ minikube ssh -- stat --file-system --format '%T' /
ramfs

cyphar · 2019-08-25T14:44:53Z

Huh. It looks like "ramfs" is what stat calls "rootfs" (or rather, initramfs). Fundamentally both the df and stat solution are using the same syscall (statfs(2)) and checking what the filesystem magic number is. Arguably "ramfs" is the correct name, given the filesystem magic number is called RAMFS_MAGIC.

tstromberg · 2019-09-03T17:53:06Z

This appears to be fixed at head. Please re-open if I am mistaken:

$ stat --file-system --format '%T' /
tmpfs

tstromberg added kind/security security issues area/guest-vm General configuration issues with the minikube guest VM labels Jan 8, 2019

tstromberg changed the title ~~Migrate away from DOCKER_RAMDISK~~ Migrate away from rootfs / DOCKER_RAMDISK Jan 29, 2019

afbjorklund added the kind/design Categorizes issue or PR as related to design. label Feb 22, 2019

tstromberg added the r/2019q2 Issue was last reviewed 2019q2 label Apr 4, 2019

afbjorklund mentioned this issue Apr 24, 2019

Support buildkit #4143

Closed

afbjorklund self-assigned this Aug 18, 2019

afbjorklund added this to the v1.4.0 Candidate milestone Aug 18, 2019

afbjorklund mentioned this issue Aug 19, 2019

Move root filesystem from rootfs to tmpfs #5133

Merged

tstromberg closed this as completed Sep 3, 2019

Migrate away from rootfs / DOCKER_RAMDISK #3512

Migrate away from rootfs / DOCKER_RAMDISK #3512

Comments

tstromberg commented Jan 8, 2019

afbjorklund commented Feb 22, 2019

AkihiroSuda commented Feb 22, 2019

cyphar commented Feb 23, 2019 • edited Loading

afbjorklund commented Feb 23, 2019 • edited Loading

cyphar commented Feb 23, 2019

afbjorklund commented Feb 23, 2019

cyphar commented Feb 23, 2019 • edited Loading

afbjorklund commented Feb 23, 2019

cyphar commented Feb 23, 2019

afbjorklund commented Feb 23, 2019

cyphar commented Feb 23, 2019

afbjorklund commented Feb 23, 2019

afbjorklund commented Feb 23, 2019

cyphar commented Feb 24, 2019 • edited Loading

afbjorklund commented Feb 24, 2019

afbjorklund commented Apr 24, 2019 • edited Loading

afbjorklund commented Apr 24, 2019

massimiliano-mantione commented Apr 26, 2019

cyphar commented Apr 26, 2019 • edited Loading

afbjorklund commented Apr 26, 2019

afbjorklund commented Apr 27, 2019 • edited Loading

afbjorklund commented Aug 14, 2019

afbjorklund commented Aug 14, 2019

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019

kfox1111 commented Aug 14, 2019 • edited Loading

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019 • edited Loading

afbjorklund commented Aug 14, 2019 • edited Loading

kfox1111 commented Aug 14, 2019 • edited Loading

afbjorklund commented Aug 14, 2019 • edited Loading

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019 • edited Loading

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019 • edited Loading

afbjorklund commented Aug 14, 2019

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019

kfox1111 commented Aug 14, 2019

afbjorklund commented Aug 14, 2019

cyphar commented Aug 22, 2019

afbjorklund commented Aug 22, 2019

afbjorklund commented Aug 24, 2019

cyphar commented Aug 25, 2019

tstromberg commented Sep 3, 2019

cyphar commented Feb 23, 2019 •

edited

Loading

afbjorklund commented Feb 23, 2019 •

edited

Loading

cyphar commented Feb 23, 2019 •

edited

Loading

cyphar commented Feb 24, 2019 •

edited

Loading

afbjorklund commented Apr 24, 2019 •

edited

Loading

cyphar commented Apr 26, 2019 •

edited

Loading

afbjorklund commented Apr 27, 2019 •

edited

Loading

kfox1111 commented Aug 14, 2019 •

edited

Loading

afbjorklund commented Aug 14, 2019 •

edited

Loading

afbjorklund commented Aug 14, 2019 •

edited

Loading

kfox1111 commented Aug 14, 2019 •

edited

Loading

afbjorklund commented Aug 14, 2019 •

edited

Loading

afbjorklund commented Aug 14, 2019 •

edited

Loading

afbjorklund commented Aug 14, 2019 •

edited

Loading