Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/int: some refactoring, fix a flake #2881

Merged
merged 7 commits into from
Apr 6, 2021

Conversation

kolyshkin
Copy link
Contributor

Fixes #2805

See all commits for descriptions. Most important ones are:

tests/int/cpt: fix lazy-pages flakiness

"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:

(00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
(00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
...
(00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.

I think what happens is

  1. The test runs runc checkpoint in lazy-pages mode in background.
  2. The test runs criu lazy-pages in background.
  3. The test runs runc restore.

Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.

At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.

At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.

As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.

🗒️ for slightly more detailed description of the above, see #2805 (comment))

The remedy here is to change the name of systemd unit to which the
container is restored.

tests/int: really randomize cgroup/unit names

Commit 41670e2 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but

  • the randomization was done only if set_cgroups_path is called,
    which is not done by every test;

  • the randomization was per bats instance, not per test.

Fix both issues by refactoring init_cgroups_path/set_cgroups_path
(moving variable part to set_cgroups_path), and calling the latter
from runc_spec, so it is now applicable to every container.

@kolyshkin
Copy link
Contributor Author

@adrianreber PTAL (last commit only)

Helper function init_cgroup_paths sets two sets of cgroup path variables
for cgroup v1 case (below XXX is cgroup controller name, e.g. MEMORY):

1. CGROUP_XXX_BASE_PATH -- path to XXX controller mount point
   (e.g. CGROUP_MEMORY_BASE_PATH=/sys/fs/cgroup/memory);

2. CGROUP_XXX -- path to the particular container XXX controller cgroup
   (e.g. CGROUP_MEMORY=/sys/fs/cgroup/memory/runc-cgroups-integration-test/test-cgroup).

The second set of variables is mostly used by check_cgroup_value(),
with only two exceptions:
 - CGROUP_CPU in @test "update rt period and runtime";
 - few CGROUP_XXX in @test "runc delete --force in cgroupv1 with
   subcgroups".

Remove these variables, as their values are not used much
and are easy to get (as can be seen in modified test cases).

While at it, mark some variables as local.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

Need to figure out why rootless fails after the second commit (force randomized cgroup paths). For tomorrow

@adrianreber
Copy link
Contributor

@adrianreber PTAL (last commit only)

Uh... I like your analysis and the description of the solution. Sounds plausible to me. Not sure I can say anything about the actual code changes. You remove more lines than you add. That is usually a good sign 😄

Commit 41670e2 removed BUSYBOX_BUNDLE env var, but c3ffd2e
was developed before 41670e2 was merged.

Everything still works because now BUSYBOX_BUNDLE has no value.
Nevertheless, let's remove it to avoid confusion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 41670e2 added some randomization to cgroup paths
and (if systemd cgroup driver is used) systemd unit names,
but the randomization was per bats instance, not per test.

Fix this by refactoring init_cgroups_path/set_cgroups_path
(moving variable/random part to set_cgroups_path).

NOTE though that the randomization is only performed for those tests
that explicitly call set_cgroups_path.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Contributor Author

@adrianreber sorry, this is not yet ready, still working on it and I got the patch I wanted you to look at removed. Basically I just change the cgroupsPath, so the container is restored into a different cgroup. I'll let you know once it's ready.

In check_pipes, make sure we
 - close all fds we opened in setup_pipes;
 - check that runc stderr is empty (and fail if it's not).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Remove printing criu args as now they are *always swrk 3.
2. Remove duplicated "feature check says" debug.

Before:

> DEBU[0000] Using CRIU with following args: [swrk 3]
> DEBU[0000] Using CRIU in FEATURE_CHECK mode
> DEBU[0000] Feature check says: type:FEATURE_CHECK success:true features:<mem_track:false lazy_pages:true >
> DEBU[0000] Feature check says: mem_track:false lazy_pages:true

After:

> DEBU[0000] Using CRIU in FEATURE_CHECK mode
> DEBU[0000] Feature check says: mem_track:false lazy_pages:true

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
"checkpoint --lazy-pages and restore" test sometimes fails on restore
in our CI on Fedora 33 when systemd cgroup driver is used:

> (00.076104) Error (compel/src/lib/infect.c:1513): Task 48521 is in unexpected state: f7f
> (00.076122) Error (compel/src/lib/infect.c:1520): Task stopped with 15: Terminated
> ...
> (00.078246) Error (criu/cr-restore.c:2483): Restoring FAILED.

I think what happens is

1. The test runs runc checkpoint in lazy-pages mode in background.
2. The test runs criu lazy-pages in background.
3. The test runs runc restore.

Now, all three are working in together: criu restore restores, criu
lazy-pages listens for page faults on a uffd and fetch missing pages
from runc checkpoint, who serves those pages.

At some point criu lazy-pages decides to fetch the rest of the pages,
and once it's done it exits, and runc checkpoint, as there are no more
pages to serve, exits too.

At the end of runc checkpoint the container is removed (see "defer
destroy(container)" in checkpoint.go. This involves a call to
cgroupManager.Destroy, which, in case systemd manager is used,
calls stopUnit, which makes systemd to not just remove the unit,
but also send SIGTERM to its processes, if there are any.

As the container is being restored into the same systemd unit,
sometimes this results in sending SIGTERM to a process which
criu restores, and thus restoring fails.

The remedy here is to change the name of systemd unit to which the
container is restored.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin kolyshkin marked this pull request as ready for review April 1, 2021 20:56
@kolyshkin
Copy link
Contributor Author

kolyshkin commented Apr 1, 2021

This presumably fixes a flake, and all the changes (except couple of removed debug prints in the checkpoint code) are in tests/integration, so adding rc94 milestone.

@kolyshkin kolyshkin added this to the 1.0.0-rc94 milestone Apr 1, 2021
@kolyshkin
Copy link
Contributor Author

@cyphar @AkihiroSuda PTAL

@kolyshkin kolyshkin requested a review from cyphar April 5, 2021 23:30
@AkihiroSuda AkihiroSuda merged commit c453f1a into opencontainers:master Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

not ok 15 checkpoint --lazy-pages and restore (Fedora 33, runc restore fails)
4 participants