nixos: fix some remaining consequences of network-online.target dep fix #282795

lf- · 2024-01-22T05:30:14Z

Consequence of: #258680

You can see the failed tests here: https://hydra.nixos.org/eval/1803728#tabs-now-fail

Some of these seem to just be flakiness that might be exposed by the change, some are baffling (like firefox), and some are weird timeouts that make no sense but might be related.

Description of changes

Things done

Add a 👍 reaction to pull requests you find important.

Fixes NixOS#282640 Consequence of: NixOS#258680

This seems to be caused by not successfully waiting for the container to actually set up the network, so an unsurprising failure.

This flakiness might have been exposed by NixOS#258680 but it definitely was buggy before, since rss2email finishes its job and exits, which means the unit would be considered stopped if it completed. It's not (and shouldn't be) a oneshot service, so we need to use a different method to detect its completion state.

Seems to be that the hosts are not reachable by each other if network-online is not pulled in.

This one was waiting for systemd-networkd-wait-online.service, which is appropriately not pulled in by the default target.

I honestly have no idea why this was broken. It makes no sense at all, and I wonder if we have an unrelated race condition this papers over. (finished: waiting for unit sshd, in 10.23 seconds) server: waiting for unit iodined (finished: waiting for unit iodined, in 0.03 seconds) client: waiting for unit iodine-testClient client: waiting for the VM to finish booting client: Guest shell says: b'Spawning backdoor root shell...\n' client: connected to guest root shell client: (connecting took 0.00 seconds) (finished: waiting for the VM to finish booting, in 0.00 seconds) (finished: waiting for unit iodine-testClient, in 0.04 seconds) client: must succeed: check_ssh -H 10.53.53.1 client: output: CRITICAL - Socket timeout after 10 seconds

Our running hypothesis for why this was broken is that the service transitioned active -> failed/inactive, wait_for_unit caught it in that state, and then it said "inactive, no pending jobs". The fix to this is to use RestartMode=direct, which will keep the unit active while the restart is in progress.

I have genuinely no idea why this was necessary, and I suspect it could be masking a race condition.

https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75 It appears that somehow, systemctl -q is-active has different results if run too early. This makes no sense at all. We put in a sleep since it fixes the problem, but the problem makes no sense.

ElvishJerricco

For the most part, this all looks fine to me. Though, I don't fully understand all of it, but that's mainly because of my lack of domain knowledge.

ElvishJerricco · 2024-01-22T23:06:56Z

nixos/tests/goss.nix

@@ -28,7 +28,7 @@ import ./make-test-python.nix ({ pkgs, lib, ... }: {
        };
        group.root.exists = true;
        kernel-param."kernel.ostype".value = "Linux";
-        service.goss = {
+        service."systemd-journald" = {


why the change from goss to systemd-journald?

The shape of the failure is goss reporting that some service is not running. journald should be started very very early and is definitely enabled, so this was trying to make the whole thing more deterministic, in case systemd didn't acknowledge goss as running or something. Nevertheless it still would fail without the sleep.

This makes absolutely no sense, because the monitor here is simply invoking systemctl, which, I simply have no idea why it is not working, basically, and why it fixes itself later in boot. This change and the delay should be enough to make this not happen anymore, but I have no idea why it was happening in the first place. https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75

AFAICT this is not my regression, and it at best just exposes a test that was flakey for bizarre reasons already.

ElvishJerricco · 2024-01-22T23:07:34Z

nixos/modules/services/networking/iodine.nix

-            after = [ "network.target" ];
+
+            # Seems to save bad network state, breaking it if it is started too
+            # early relative to the network.


Was this just always flaky? Or is this breakage new?

I think this breakage might be new, but I absolutely don't understand why it is happening, given that it is resolved by depending on network-online.target.

ElvishJerricco · 2024-01-22T23:07:57Z

nixos/tests/goss.nix

    with subtest("returns health status"):
      result = json.loads(machine.succeed("curl -sS http://localhost:8080/healthz"))
+      print(result)


Guessing this print was for development purposes and can be removed? Not that it matters

Yeah, it was left in so the test is debuggable if it fails in the future, since it doesn't print out the failures.

ElvishJerricco · 2024-01-22T23:10:16Z

nixos/tests/mediamtx.nix

@@ -23,6 +23,8 @@ import ./make-test-python.nix ({ pkgs, lib, ...} :
          DynamicUser = true;
          Restart = "on-failure";
          RestartSec = "1s";
+          # restart service without going through failed/inactive state that confuses wait_for_unit
+          RestartMode = "direct";


I'm curious why it seems to be expected that this service will need to restart, but I suppose it was already like that.

It needs to restart because the rtsp stream broker in the middle doesn't have its sockets up immediately, so it will have a few false starts on connecting to the stream broker.

We could probably try harder, but shrug. This works.

lf- · 2024-01-23T02:05:30Z

Other breakage in that hydra eval:
bees: https://cache.nixos.org/log/zfbx1d4flzkw3qk30mv6hb1wyhhvgk2f-vm-test-run-bees.drv - believed to be at best a race condition, but doesn't touch network, so ???
fcitx: https://cache.nixos.org/log/aakhfl6nr893c5k7882nj2zc5niw7dak-vm-test-run-fcitx5.drv - I have no idea at all honestly. can't much understand why this would be our regression.
firefox: https://cache.nixos.org/log/hflfz6m3nzx7zc33w6mmm4fnwkl8qzhf-vm-test-run-firefox.drv - EMFILE system-wide. how?! Was told by K900 this is an existing issue
floorp: https://cache.nixos.org/log/5h7c9s6b7y80hlbgihsw08ppw75mi2xz-vm-test-run-floorp.drv - maybe same issue as firefox, but seems like a dodgy test design. idk.
hadoop: https://cache.nixos.org/log/vidh0xjw31n05ih0jp9075c2l8v9gq6r-vm-test-run-hadoop-hbase.drv client disconnect? doesn't feel like it was our fault, but idk.
kafka: https://cache.nixos.org/log/l9ifr8d10vsfp4zlvfq26h5r571gvl3i-vm-test-run-kafka_2_8.drv this is actually plausibly our fault, and probably fixed by making it depend on network-online.target somewhere, perhaps in the test, perhaps in the service. leaning toward service because it was started by systemd and failed there. cannot repro 😭
keycloak: https://cache.nixos.org/log/1rrljqp27nir794i1p728jz9hjwn8vdh-vm-test-run-keycloak.drv 401s?!?! Can't really see this being our fault, but idk.
nixos-rebuild: https://cache.nixos.org/log/q8wpwws6fchnlp6byqdj7v982zpsz2kf-vm-test-run-nixos-rebuild-target-host.drv works on my machine 😭
opentabletdriver: https://cache.nixos.org/log/v41kqqcbir81pnwgqfc9fmxq0705x409-vm-test-run-opentabletdriver.drv - suspect race condition, may have been exposed by us but not caused by us
starship: https://cache.nixos.org/log/pmjzgn5xyrw4q13l27n2k6il6f7n0lin-vm-test-run-starship.drv - at best it is broken because of network interfaces not being up, but this is a shell prompt?!
warzone2100: https://cache.nixos.org/log/rgkbmrmm6xlf5j75dni5fw6jhh3lyffv-vm-test-run-warzone2100.drv - ????? maybe network dependency?? doubt it? but idk?

The x11 based ones failing might be indicative of needing tighter dependencies, but the rest of these just look like weird flakiness.

jpotier · 2024-01-23T20:35:38Z

I've been seing this for a few days:

Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]: [14B blob data]
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        … while calling the 'head' builtin
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:          at /nix/store/rqdf3gfjq8zh488msnv62h03kkzr308q-source/lib/attrsets.nix:960:11:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           959|         || pred here (elemAt values 1) (head values) then
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           960|           head values
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:              |           ^
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           961|         else
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        … while evaluating the attribute 'value'
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:          at /nix/store/rqdf3gfjq8zh488msnv62h03kkzr308q-source/lib/modules.nix:809:9:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           808|     in warnDeprecation opt //
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           809|       { value = builtins.addErrorContext "while evaluating the option `${showOption loc}':" value;
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:              |         ^
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:           810|         inherit (res.defsFinal') highestPrio;
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        (stack trace truncated; use '--show-trace' to show the full trace)
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        error:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        Failed assertions:
Jan 23 08:00:33 kktdr nixos-upgrade-start[473727]:        - mopidy.service is ordered after 'network-online.target' but doesn't depend on it
Jan 23 08:00:33 kktdr systemd[1]: nixos-upgrade.service: Main process exited, code=exited, status=1/FAILURE
Jan 23 08:00:33 kktdr systemd[1]: nixos-upgrade.service: Failed with result 'exit-code'.
Jan 23 08:00:33 kktdr systemd[1]: Failed to start NixOS Upgrade.

I'm assuming mopidy also needs a patch?

lf- · 2024-01-23T20:40:13Z

yup

nixpkgs/nixos/modules/services/audio/mopidy.nix

Line 79 in 0e756e6

after = [ "network-online.target" "sound.target" ];

i will add it when i get to my computer. unsure why this was missed.

ElvishJerricco · 2024-01-25T06:58:17Z

I grepped through nixpkgs and found all these services that are at least suspicious still since #258680.

tarsnap-restore-${name}
errbot
alertmanager-irc-replay
sachet
kea-ctrl-agent
bitwarden-directory-connector-cli.timer
lifecycled-queue-cleaner.timer

I also found some other things which aren't related to #258680, but seemed worth noting while I was grepping passed them. All of these pull in network-online.target but aren't ordered after it like they almost certainly should be.

dnscrypt-proxy2
pptpd
xl2tpd
zerotierone
amazon-init
tests/jool.nix
beam.section.md - documents pulling in network-online.target but not ordering after it.
samba.target

And these ones are wantedBy = ["network-online.target"], when they should probably be consumers of network-online instead of part of it.

lifecycled (especially weird because it says wantedBy = ["network-online.target"]; but the unit shipped with the package says After=network-online.target and WantedBy=multi-user.target).
chisel-server
snowflake-proxy - This one's actually probably correct, but I'm unsure enough that it'd be nice for someone who knows it better to confirm.

…ector-cli under network-online dep fix

lifecycled is a thing that monitors for your server being about to be killed by The Cloud. It definitely depends on actually having the network configured, and I don't know why it was depending on network-online.target; that seems like a bug.

This is a server offering network access to others. Maybe it needs a network to bootstrap itself, idk. But it's not needed to bring the machine online.

If we believe the author of this module that network-online.target is required to begin with (possibly dubious), we need to be ordered after it.

Although we don't really know if network-online.target is actually necessary in the first place, it seems to be authorial intent to be after it.

We suspect that this is the authorial intent. Not sure why it requires after=multi-user.target, seems like suspicious code, but leaving it as-is.

This is literally netcat listening on a port on [::], this definitely does not depend on network-online.

If you require network-online.target (which we really only put in here conservatively, it's often not necessary), you want to be ordered after it. We can change After=network.target to After=network-online.target, because the latter is transitively After the former.

Reading the commit message on 50df42f, depending on network-online.target is definitely intentional and potentially fixes real issues. I *think* sticking it on the target should work fine, because the services are brought in by the target, and if the target itself is After=, that dependency relation should hold.

lf- · 2024-01-25T08:13:07Z

I grepped through nixpkgs and found all these services that are at least suspicious still since #258680.
* `tarsnap-restore-${name}`

Saw this in my previous pass but thought it ok. On second thought I think it should also be After=, I thought that Requires= puts an ordering dependency but I don't think it does.

* `errbot`

oops.

* `alertmanager-irc-replay`

oops.

* `sachet`
* `kea-ctrl-agent`

* `bitwarden-directory-connector-cli.timer`

* `lifecycled-queue-cleaner.timer`
I also found some other things which aren't related to #258680, but seemed worth noting while I was grepping passed them. All of these pull in network-online.target but aren't ordered after it like they almost certainly should be.
* `dnscrypt-proxy2`

Hm. It's required for DNS judging by its wantedby of nss-lookup.target but it also probably needs the network. Should it be ordered after the network is online?

* `pptpd`

Server. May not require it at all, but if it does require it, it should surely be after=. Last actually touched in 2016 besides tree-wide refactors I think. Same author as the next one :D

* `xl2tpd`

Suspect this author didn't know how systemd deps worked and nobody noticed it in reviews in 2016. Another vpn server. Honestly maybe we should remove these both entirely? idk. seem either unmaintained or work perfectly?

* `zerotierone`

Fancy tailscale-like VPN thingy? I think? I can sorta understand their after=network.target wants=network-online.target, I suppose. Maybe their entire setup is wrong, idk. Not touching it.

* `amazon-init`

I can believe this requiring network. Probably authorial intent that is after the network.

* `tests/jool.nix`

Finally one sufficiently obvious where we actually can delete the dependency altogether because it is obviously a misuse, for a netcat!

* `beam.section.md` - _documents_ pulling in `network-online.target` but not ordering after it.

Changed. Dubious whether it should even pull it in but it is user authored software that we can't assume anything about, maybe it hits an api on startup.

* `samba.target`

Found someone wrote a helpful commit message, yay: 50df42f. Appears the intent here is that network actually should be online with the correct IPs.

And these ones are wantedBy = ["network-online.target"], when they should probably be consumers of network-online instead of part of it.
* `lifecycled` (especially weird because it says `wantedBy = ["network-online.target"];` but the unit shipped with the package says `After=network-online.target` and `WantedBy=multi-user.target`).

* `chisel-server`

briefly reviewed these, I think I agree and I'm changing them.

* `snowflake-proxy` - This one's actually probably correct, but I'm unsure enough that it'd be nice for someone who knows it better to confirm.

@yayayayaka

Use the same dependency structure as networkd upstream.

Fixes NixOS#283717 It brings me *zero bloody pleasure* to say that we apparently didn't have any tests for networkmanager of all things! NixOS#71151

nixos/borgbackup: fix under network-online dep fix

35187f0

Fixes NixOS#282640 Consequence of: NixOS#258680

github-actions bot added 6.topic: nixos 8.has: module (update) labels Jan 22, 2024

lf- added 2 commits January 21, 2024 21:48

nixos/tests: fix systemd-networkd under network-online fix

4697f75

nixos/tests: fix kea under network-online fix

6f8433d

ofborg bot added 10.rebuild-darwin: 0 10.rebuild-linux: 1-10 labels Jan 22, 2024

lf- added 6 commits January 21, 2024 22:01

nixos/tests: fix systemd-machinectl under network-online dep fix

288ec12

This seems to be caused by not successfully waiting for the container to actually set up the network, so an unsurprising failure.

nixos/tests: fix rsyncd under network-online dep fix

514a606

Seems to be that the hosts are not reachable by each other if network-online is not pulled in.

nixos/tests: fix systemd-bpf under network-online.target dep fix

da9fdd1

This one was waiting for systemd-networkd-wait-online.service, which is appropriately not pulled in by the default target.

lf- changed the title ~~nixos/borgbackup: fix under network-online dep fix~~ nixos: fix some remaining consequences of network-online.target dep fix Jan 22, 2024

lf- requested a review from mweinelt as a code owner January 22, 2024 07:14

lf- added 2 commits January 21, 2024 23:29

nixos/tests: fix jool under network-online dep fix

2c9a5dd

I have genuinely no idea why this was necessary, and I suspect it could be masking a race condition.

lf- requested a review from a team January 22, 2024 20:22

ElvishJerricco reviewed Jan 22, 2024

View reviewed changes

lf- added 5 commits January 23, 2024 15:09

nixos/liquidsoap: fix liquidsoap under network-online dep fix

a081cf8

nixos/mopidy: fix mopidy under network-online dep fix

4d6be37

nixos/tts: fix tts under network-online dep fix

bfa13cc

nixos/wyoming: fix wyoming under network-online dep fix

1893853

nixos/netclient: fix service dependencies on network-online

5cf6d4f

lf- added 2 commits January 24, 2024 23:34

nixos/errbot: fix errbot under network-online dep fix

ad47b1e

nixos/tarsnap: fix tarsnap under network-online dep fix

b2b9687

lf- added 13 commits January 24, 2024 23:36

nixos/prometheus: fix prometheus under network-online dep fix

dc389cc

nixos/sachet: fix sachet under network-online dep fix

853e16e

nixos/kea: fix kea under network-online dep fix

4f5d54a

nixos/bitwarden-directory-connector-cli: fix bitwarden-directory-conn…

3fc69ec

…ector-cli under network-online dep fix

nixos/lifecycled: fix lifecycled under network-online dep fix

102df28

nixos/lifecycled: fix service dependencies

7323136

lifecycled is a thing that monitors for your server being about to be killed by The Cloud. It definitely depends on actually having the network configured, and I don't know why it was depending on network-online.target; that seems like a bug.

nixos/chisel-server: fix service dependencies

e2db5a4

This is a server offering network access to others. Maybe it needs a network to bootstrap itself, idk. But it's not needed to bring the machine online.

nixos/pptpd: order after network-online.target

47e8f1a

If we believe the author of this module that network-online.target is required to begin with (possibly dubious), we need to be ordered after it.

nixos/l2tp: fix service dependencies

0182d66

Although we don't really know if network-online.target is actually necessary in the first place, it seems to be authorial intent to be after it.

nixos/amazon-init: add after=network-online.target

c5f7c29

We suspect that this is the authorial intent. Not sure why it requires after=multi-user.target, seems like suspicious code, but leaving it as-is.

nixosTests.jool: fix service dependencies

ab7b0a0

This is literally netcat listening on a port on [::], this definitely does not depend on network-online.

github-actions bot added 8.has: documentation 6.topic: erlang labels Jan 25, 2024

lf- added 2 commits January 25, 2024 11:17

nixos/connman: fix service dependencies

b91485b

Use the same dependency structure as networkd upstream.

nixos/networkmanager: fix service dependencies

df51365

Fixes NixOS#283717 It brings me *zero bloody pleasure* to say that we apparently didn't have any tests for networkmanager of all things! NixOS#71151

lf- mentioned this pull request Jan 25, 2024

knot-dns: 3.3.3 -> 3.3.4 #283442

Merged

13 tasks

ElvishJerricco mentioned this pull request Jan 25, 2024

nixos/systemd: Temporarily bring back multi-user -> network-online #283818

Merged

13 tasks

patka-123 mentioned this pull request Jan 27, 2024

Add order for nixos module systemd service boinkor-net/tsnsrv#94

Merged

wegank added the 2.status: merge conflict label Mar 20, 2024

eclairevoyant mentioned this pull request Apr 26, 2024

mopidy: systemd warning during build #307043

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nixos: fix some remaining consequences of network-online.target dep fix #282795

nixos: fix some remaining consequences of network-online.target dep fix #282795

lf- commented Jan 22, 2024 •

edited

Loading

ElvishJerricco left a comment

ElvishJerricco Jan 22, 2024

lf- Jan 23, 2024

ElvishJerricco Jan 22, 2024

lf- Jan 23, 2024

ElvishJerricco Jan 22, 2024

lf- Jan 23, 2024

ElvishJerricco Jan 22, 2024

lf- Jan 23, 2024

lf- commented Jan 23, 2024 •

edited

Loading

jpotier commented Jan 23, 2024

lf- commented Jan 23, 2024 •

edited

Loading

ElvishJerricco commented Jan 25, 2024 •

edited

Loading

lf- commented Jan 25, 2024

nixos: fix some remaining consequences of network-online.target dep fix #282795

Are you sure you want to change the base?

nixos: fix some remaining consequences of network-online.target dep fix #282795

Conversation

lf- commented Jan 22, 2024 • edited Loading

Description of changes

Things done

ElvishJerricco left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lf- commented Jan 23, 2024 • edited Loading

jpotier commented Jan 23, 2024

lf- commented Jan 23, 2024 • edited Loading

ElvishJerricco commented Jan 25, 2024 • edited Loading

lf- commented Jan 25, 2024

lf- commented Jan 22, 2024 •

edited

Loading

lf- commented Jan 23, 2024 •

edited

Loading

lf- commented Jan 23, 2024 •

edited

Loading

ElvishJerricco commented Jan 25, 2024 •

edited

Loading