Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nixos: fix some remaining consequences of network-online.target dep fix #282795

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

lf-
Copy link
Member

@lf- lf- commented Jan 22, 2024

Fixes #282640

Consequence of: #258680

You can see the failed tests here: https://hydra.nixos.org/eval/1803728#tabs-now-fail

Some of these seem to just be flakiness that might be exposed by the change, some are baffling (like firefox), and some are weird timeouts that make no sense but might be related.

Description of changes

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

This seems to be caused by not successfully waiting for the container to
actually set up the network, so an unsurprising failure.
This flakiness might have been exposed by
NixOS#258680 but it definitely was buggy
before, since rss2email finishes its job and exits, which means the unit
would be considered stopped if it completed. It's not (and shouldn't be)
a oneshot service, so we need to use a different method to detect its
completion state.
Seems to be that the hosts are not reachable by each other if
network-online is not pulled in.
This one was waiting for systemd-networkd-wait-online.service, which is
appropriately not pulled in by the default target.
I honestly have no idea why this was broken. It makes no sense at all,
and I wonder if we have an unrelated race condition this papers over.

(finished: waiting for unit sshd, in 10.23 seconds)
server: waiting for unit iodined
(finished: waiting for unit iodined, in 0.03 seconds)
client: waiting for unit iodine-testClient
client: waiting for the VM to finish booting
client: Guest shell says: b'Spawning backdoor root shell...\n'
client: connected to guest root shell
client: (connecting took 0.00 seconds)
(finished: waiting for the VM to finish booting, in 0.00 seconds)
(finished: waiting for unit iodine-testClient, in 0.04 seconds)
client: must succeed: check_ssh -H 10.53.53.1
client: output: CRITICAL - Socket timeout after 10 seconds
Our running hypothesis for why this was broken is that the service
transitioned active -> failed/inactive, wait_for_unit caught it in that
state, and then it said "inactive, no pending jobs".

The fix to this is to use RestartMode=direct, which will keep the unit
active while the restart is in progress.
@lf- lf- changed the title nixos/borgbackup: fix under network-online dep fix nixos: fix some remaining consequences of network-online.target dep fix Jan 22, 2024
@lf- lf- requested a review from mweinelt as a code owner January 22, 2024 07:14
I have genuinely no idea why this was necessary, and I suspect it could
be masking a race condition.
https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75
It appears that somehow, systemctl -q is-active has different results if
run too early. This makes no sense at all.

We put in a sleep since it fixes the problem, but the problem makes no
sense.
@lf- lf- requested a review from a team January 22, 2024 20:22
Copy link
Contributor

@ElvishJerricco ElvishJerricco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the most part, this all looks fine to me. Though, I don't fully understand all of it, but that's mainly because of my lack of domain knowledge.

@@ -28,7 +28,7 @@ import ./make-test-python.nix ({ pkgs, lib, ... }: {
};
group.root.exists = true;
kernel-param."kernel.ostype".value = "Linux";
service.goss = {
service."systemd-journald" = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the change from goss to systemd-journald?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape of the failure is goss reporting that some service is not running. journald should be started very very early and is definitely enabled, so this was trying to make the whole thing more deterministic, in case systemd didn't acknowledge goss as running or something. Nevertheless it still would fail without the sleep.

This makes absolutely no sense, because the monitor here is simply invoking systemctl, which, I simply have no idea why it is not working, basically, and why it fixes itself later in boot. This change and the delay should be enough to make this not happen anymore, but I have no idea why it was happening in the first place. https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75

AFAICT this is not my regression, and it at best just exposes a test that was flakey for bizarre reasons already.

after = [ "network.target" ];

# Seems to save bad network state, breaking it if it is started too
# early relative to the network.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this just always flaky? Or is this breakage new?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this breakage might be new, but I absolutely don't understand why it is happening, given that it is resolved by depending on network-online.target.

with subtest("returns health status"):
result = json.loads(machine.succeed("curl -sS http://localhost:8080/healthz"))
print(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guessing this print was for development purposes and can be removed? Not that it matters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it was left in so the test is debuggable if it fails in the future, since it doesn't print out the failures.

@@ -23,6 +23,8 @@ import ./make-test-python.nix ({ pkgs, lib, ...} :
DynamicUser = true;
Restart = "on-failure";
RestartSec = "1s";
# restart service without going through failed/inactive state that confuses wait_for_unit
RestartMode = "direct";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why it seems to be expected that this service will need to restart, but I suppose it was already like that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs to restart because the rtsp stream broker in the middle doesn't have its sockets up immediately, so it will have a few false starts on connecting to the stream broker.

We could probably try harder, but shrug. This works.

@lf-
Copy link
Member Author

lf- commented Jan 23, 2024

Other breakage in that hydra eval:
bees: https://cache.nixos.org/log/zfbx1d4flzkw3qk30mv6hb1wyhhvgk2f-vm-test-run-bees.drv - believed to be at best a race condition, but doesn't touch network, so ???
fcitx: https://cache.nixos.org/log/aakhfl6nr893c5k7882nj2zc5niw7dak-vm-test-run-fcitx5.drv - I have no idea at all honestly. can't much understand why this would be our regression.
firefox: https://cache.nixos.org/log/hflfz6m3nzx7zc33w6mmm4fnwkl8qzhf-vm-test-run-firefox.drv - EMFILE system-wide. how?! Was told by K900 this is an existing issue
floorp: https://cache.nixos.org/log/5h7c9s6b7y80hlbgihsw08ppw75mi2xz-vm-test-run-floorp.drv - maybe same issue as firefox, but seems like a dodgy test design. idk.
hadoop: https://cache.nixos.org/log/vidh0xjw31n05ih0jp9075c2l8v9gq6r-vm-test-run-hadoop-hbase.drv client disconnect? doesn't feel like it was our fault, but idk.
kafka: https://cache.nixos.org/log/l9ifr8d10vsfp4zlvfq26h5r571gvl3i-vm-test-run-kafka_2_8.drv this is actually plausibly our fault, and probably fixed by making it depend on network-online.target somewhere, perhaps in the test, perhaps in the service. leaning toward service because it was started by systemd and failed there. cannot repro 😭
keycloak: https://cache.nixos.org/log/1rrljqp27nir794i1p728jz9hjwn8vdh-vm-test-run-keycloak.drv 401s?!?! Can't really see this being our fault, but idk.
nixos-rebuild: https://cache.nixos.org/log/q8wpwws6fchnlp6byqdj7v982zpsz2kf-vm-test-run-nixos-rebuild-target-host.drv works on my machine 😭
opentabletdriver: https://cache.nixos.org/log/v41kqqcbir81pnwgqfc9fmxq0705x409-vm-test-run-opentabletdriver.drv - suspect race condition, may have been exposed by us but not caused by us
starship: https://cache.nixos.org/log/pmjzgn5xyrw4q13l27n2k6il6f7n0lin-vm-test-run-starship.drv - at best it is broken because of network interfaces not being up, but this is a shell prompt?!
warzone2100: https://cache.nixos.org/log/rgkbmrmm6xlf5j75dni5fw6jhh3lyffv-vm-test-run-warzone2100.drv - ????? maybe network dependency?? doubt it? but idk?

The x11 based ones failing might be indicative of needing tighter dependencies, but the rest of these just look like weird flakiness.

@jpotier
Copy link
Contributor

jpotier commented Jan 23, 2024

I've been seing this for a few days:

Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]: [14B blob data]
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        … while calling the 'head' builtin
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:          at /nix/store/rqdf3gfjq8zh488msnv62h03kkzr308q-source/lib/attrsets.nix:960:11:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           959|         || pred here (elemAt values 1) (head values) then
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           960|           head values
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:              |           ^
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           961|         else
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        … while evaluating the attribute 'value'
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:          at /nix/store/rqdf3gfjq8zh488msnv62h03kkzr308q-source/lib/modules.nix:809:9:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           808|     in warnDeprecation opt //
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           809|       { value = builtins.addErrorContext "while evaluating the option `${showOption loc}':" value;
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:              |         ^
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           810|         inherit (res.defsFinal') highestPrio;
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        (stack trace truncated; use '--show-trace' to show the full trace)
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        error:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        Failed assertions:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        - mopidy.service is ordered after 'network-online.target' but doesn't depend on it
Jan 23 08:00:33 kktdr systemd[1]: nixos-upgrade.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 08:00:33 kktdr systemd[1]: nixos-upgrade.service: Failed with result 'exit-code'.
Jan 23 08:00:33 kktdr systemd[1]: Failed to start NixOS Upgrade.

I'm assuming mopidy also needs a patch?

@lf-
Copy link
Member Author

lf- commented Jan 23, 2024

yup

after = [ "network-online.target" "sound.target" ];

i will add it when i get to my computer. unsure why this was missed.

@ElvishJerricco
Copy link
Contributor

ElvishJerricco commented Jan 25, 2024

I grepped through nixpkgs and found all these services that are at least suspicious still since #258680.

  • tarsnap-restore-${name}
  • errbot
  • alertmanager-irc-replay
  • sachet
  • kea-ctrl-agent
  • bitwarden-directory-connector-cli.timer
  • lifecycled-queue-cleaner.timer

I also found some other things which aren't related to #258680, but seemed worth noting while I was grepping passed them. All of these pull in network-online.target but aren't ordered after it like they almost certainly should be.

  • dnscrypt-proxy2
  • pptpd
  • xl2tpd
  • zerotierone
  • amazon-init
  • tests/jool.nix
  • beam.section.md - documents pulling in network-online.target but not ordering after it.
  • samba.target

And these ones are wantedBy = ["network-online.target"], when they should probably be consumers of network-online instead of part of it.

  • lifecycled (especially weird because it says wantedBy = ["network-online.target"]; but the unit shipped with the package says After=network-online.target and WantedBy=multi-user.target).
  • chisel-server
  • snowflake-proxy - This one's actually probably correct, but I'm unsure enough that it'd be nice for someone who knows it better to confirm.

lf- added 13 commits January 24, 2024 23:36
lifecycled is a thing that monitors for your server being about to be
killed by The Cloud. It definitely depends on actually having the
network configured, and I don't know why it was depending on
network-online.target; that seems like a bug.
This is a server offering network access to others. Maybe it needs a
network to bootstrap itself, idk. But it's not needed to bring the
machine online.
If we believe the author of this module that network-online.target is
required to begin with (possibly dubious), we need to be ordered after
it.
Although we don't really know if network-online.target is actually
necessary in the first place, it seems to be authorial intent to be
after it.
We suspect that this is the authorial intent. Not sure why it requires
after=multi-user.target, seems like suspicious code, but leaving it
as-is.
This is literally netcat listening on a port on [::], this definitely
does not depend on network-online.
If you require network-online.target (which we really only put in here
conservatively, it's often not necessary), you want to be ordered after
it. We can change After=network.target to After=network-online.target,
because the latter is transitively After the former.
Reading the commit message on 50df42f,
depending on network-online.target is definitely intentional and
potentially fixes real issues.

I *think* sticking it on the target should work fine, because the
services are brought in by the target, and if the target itself is
After=, that dependency relation should hold.
@lf-
Copy link
Member Author

lf- commented Jan 25, 2024

I grepped through nixpkgs and found all these services that are at least suspicious still since #258680.

* `tarsnap-restore-${name}`

Saw this in my previous pass but thought it ok. On second thought I think it should also be After=, I thought that Requires= puts an ordering dependency but I don't think it does.

* `errbot`

oops.

* `alertmanager-irc-replay`

oops.

* `sachet`
* `kea-ctrl-agent`

* `bitwarden-directory-connector-cli.timer`

* `lifecycled-queue-cleaner.timer`

I also found some other things which aren't related to #258680, but seemed worth noting while I was grepping passed them. All of these pull in network-online.target but aren't ordered after it like they almost certainly should be.

* `dnscrypt-proxy2`

Hm. It's required for DNS judging by its wantedby of nss-lookup.target but it also probably needs the network. Should it be ordered after the network is online?

* `pptpd`

Server. May not require it at all, but if it does require it, it should surely be after=. Last actually touched in 2016 besides tree-wide refactors I think. Same author as the next one :D

* `xl2tpd`

Suspect this author didn't know how systemd deps worked and nobody noticed it in reviews in 2016. Another vpn server. Honestly maybe we should remove these both entirely? idk. seem either unmaintained or work perfectly?

* `zerotierone`

Fancy tailscale-like VPN thingy? I think? I can sorta understand their after=network.target wants=network-online.target, I suppose. Maybe their entire setup is wrong, idk. Not touching it.

* `amazon-init`

I can believe this requiring network. Probably authorial intent that is after the network.

* `tests/jool.nix`

Finally one sufficiently obvious where we actually can delete the dependency altogether because it is obviously a misuse, for a netcat!

* `beam.section.md` - _documents_ pulling in `network-online.target` but not ordering after it.

Changed. Dubious whether it should even pull it in but it is user authored software that we can't assume anything about, maybe it hits an api on startup.

* `samba.target`

Found someone wrote a helpful commit message, yay: 50df42f. Appears the intent here is that network actually should be online with the correct IPs.

And these ones are wantedBy = ["network-online.target"], when they should probably be consumers of network-online instead of part of it.

* `lifecycled` (especially weird because it says `wantedBy = ["network-online.target"];` but the unit shipped with the package says `After=network-online.target` and `WantedBy=multi-user.target`).

* `chisel-server`

briefly reviewed these, I think I agree and I'm changing them.

* `snowflake-proxy` - This one's actually probably correct, but I'm unsure enough that it'd be nice for someone who knows it better to confirm.

@yayayayaka

Use the same dependency structure as networkd upstream.
Fixes NixOS#283717

It brings me *zero bloody pleasure* to say that we apparently
didn't have any tests for networkmanager of all things!
NixOS#71151
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Borgbackup timer doesn't depend on network-online.target
4 participants