-
-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nixos: fix some remaining consequences of network-online.target dep fix #282795
base: master
Are you sure you want to change the base?
Conversation
Fixes NixOS#282640 Consequence of: NixOS#258680
This seems to be caused by not successfully waiting for the container to actually set up the network, so an unsurprising failure.
This flakiness might have been exposed by NixOS#258680 but it definitely was buggy before, since rss2email finishes its job and exits, which means the unit would be considered stopped if it completed. It's not (and shouldn't be) a oneshot service, so we need to use a different method to detect its completion state.
Seems to be that the hosts are not reachable by each other if network-online is not pulled in.
This one was waiting for systemd-networkd-wait-online.service, which is appropriately not pulled in by the default target.
I honestly have no idea why this was broken. It makes no sense at all, and I wonder if we have an unrelated race condition this papers over. (finished: waiting for unit sshd, in 10.23 seconds) server: waiting for unit iodined (finished: waiting for unit iodined, in 0.03 seconds) client: waiting for unit iodine-testClient client: waiting for the VM to finish booting client: Guest shell says: b'Spawning backdoor root shell...\n' client: connected to guest root shell client: (connecting took 0.00 seconds) (finished: waiting for the VM to finish booting, in 0.00 seconds) (finished: waiting for unit iodine-testClient, in 0.04 seconds) client: must succeed: check_ssh -H 10.53.53.1 client: output: CRITICAL - Socket timeout after 10 seconds
Our running hypothesis for why this was broken is that the service transitioned active -> failed/inactive, wait_for_unit caught it in that state, and then it said "inactive, no pending jobs". The fix to this is to use RestartMode=direct, which will keep the unit active while the restart is in progress.
I have genuinely no idea why this was necessary, and I suspect it could be masking a race condition.
https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75 It appears that somehow, systemctl -q is-active has different results if run too early. This makes no sense at all. We put in a sleep since it fixes the problem, but the problem makes no sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part, this all looks fine to me. Though, I don't fully understand all of it, but that's mainly because of my lack of domain knowledge.
@@ -28,7 +28,7 @@ import ./make-test-python.nix ({ pkgs, lib, ... }: { | |||
}; | |||
group.root.exists = true; | |||
kernel-param."kernel.ostype".value = "Linux"; | |||
service.goss = { | |||
service."systemd-journald" = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the change from goss to systemd-journald?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shape of the failure is goss reporting that some service is not running. journald should be started very very early and is definitely enabled, so this was trying to make the whole thing more deterministic, in case systemd didn't acknowledge goss as running or something. Nevertheless it still would fail without the sleep.
This makes absolutely no sense, because the monitor here is simply invoking systemctl, which, I simply have no idea why it is not working, basically, and why it fixes itself later in boot. This change and the delay should be enough to make this not happen anymore, but I have no idea why it was happening in the first place. https://github.com/goss-org/goss/blob/4e36e8fb52d999e418be7d72f14bad1bfbd65737/system/service_systemd.go#L75
AFAICT this is not my regression, and it at best just exposes a test that was flakey for bizarre reasons already.
after = [ "network.target" ]; | ||
|
||
# Seems to save bad network state, breaking it if it is started too | ||
# early relative to the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this just always flaky? Or is this breakage new?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this breakage might be new, but I absolutely don't understand why it is happening, given that it is resolved by depending on network-online.target.
with subtest("returns health status"): | ||
result = json.loads(machine.succeed("curl -sS http://localhost:8080/healthz")) | ||
print(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guessing this print
was for development purposes and can be removed? Not that it matters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it was left in so the test is debuggable if it fails in the future, since it doesn't print out the failures.
@@ -23,6 +23,8 @@ import ./make-test-python.nix ({ pkgs, lib, ...} : | |||
DynamicUser = true; | |||
Restart = "on-failure"; | |||
RestartSec = "1s"; | |||
# restart service without going through failed/inactive state that confuses wait_for_unit | |||
RestartMode = "direct"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why it seems to be expected that this service will need to restart, but I suppose it was already like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It needs to restart because the rtsp stream broker in the middle doesn't have its sockets up immediately, so it will have a few false starts on connecting to the stream broker.
We could probably try harder, but shrug. This works.
Other breakage in that hydra eval: The x11 based ones failing might be indicative of needing tighter dependencies, but the rest of these just look like weird flakiness. |
I've been seing this for a few days:
I'm assuming mopidy also needs a patch? |
yup
i will add it when i get to my computer. unsure why this was missed. |
I grepped through nixpkgs and found all these services that are at least suspicious still since #258680.
I also found some other things which aren't related to #258680, but seemed worth noting while I was grepping passed them. All of these pull in
And these ones are
|
…ector-cli under network-online dep fix
lifecycled is a thing that monitors for your server being about to be killed by The Cloud. It definitely depends on actually having the network configured, and I don't know why it was depending on network-online.target; that seems like a bug.
This is a server offering network access to others. Maybe it needs a network to bootstrap itself, idk. But it's not needed to bring the machine online.
If we believe the author of this module that network-online.target is required to begin with (possibly dubious), we need to be ordered after it.
Although we don't really know if network-online.target is actually necessary in the first place, it seems to be authorial intent to be after it.
We suspect that this is the authorial intent. Not sure why it requires after=multi-user.target, seems like suspicious code, but leaving it as-is.
This is literally netcat listening on a port on [::], this definitely does not depend on network-online.
If you require network-online.target (which we really only put in here conservatively, it's often not necessary), you want to be ordered after it. We can change After=network.target to After=network-online.target, because the latter is transitively After the former.
Reading the commit message on 50df42f, depending on network-online.target is definitely intentional and potentially fixes real issues. I *think* sticking it on the target should work fine, because the services are brought in by the target, and if the target itself is After=, that dependency relation should hold.
Saw this in my previous pass but thought it ok. On second thought I think it should also be After=, I thought that Requires= puts an ordering dependency but I don't think it does.
oops.
oops.
Hm. It's required for DNS judging by its wantedby of nss-lookup.target but it also probably needs the network. Should it be ordered after the network is online?
Server. May not require it at all, but if it does require it, it should surely be after=. Last actually touched in 2016 besides tree-wide refactors I think. Same author as the next one :D
Suspect this author didn't know how systemd deps worked and nobody noticed it in reviews in 2016. Another vpn server. Honestly maybe we should remove these both entirely? idk. seem either unmaintained or work perfectly?
Fancy tailscale-like VPN thingy? I think? I can sorta understand their after=network.target wants=network-online.target, I suppose. Maybe their entire setup is wrong, idk. Not touching it.
I can believe this requiring network. Probably authorial intent that is after the network.
Finally one sufficiently obvious where we actually can delete the dependency altogether because it is obviously a misuse, for a netcat!
Changed. Dubious whether it should even pull it in but it is user authored software that we can't assume anything about, maybe it hits an api on startup.
Found someone wrote a helpful commit message, yay: 50df42f. Appears the intent here is that network actually should be online with the correct IPs.
briefly reviewed these, I think I agree and I'm changing them.
|
Use the same dependency structure as networkd upstream.
Fixes NixOS#283717 It brings me *zero bloody pleasure* to say that we apparently didn't have any tests for networkmanager of all things! NixOS#71151
Fixes #282640
Consequence of: #258680
You can see the failed tests here: https://hydra.nixos.org/eval/1803728#tabs-now-fail
Some of these seem to just be flakiness that might be exposed by the change, some are baffling (like firefox), and some are weird timeouts that make no sense but might be related.
Description of changes
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.