-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Conditionally remove networkd online dependency on Ubuntu #5772
base: main
Are you sure you want to change the base?
Conversation
Take 2... I added a commit that instead of using a drop-in will call the systemd-networkd-wait-online binary using the exact same args that the service uses based on 'systemctl cat'. This removes the need for a costly daemon-reload in the cases we need to wait on network. I haven't touched integration tests since the first commit, so they will have known failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TheRealFalcon for working this. quick first pass review as your are actively coding/developing tests for this functionality
@@ -411,6 +474,9 @@ def main_init(name, args): | |||
mode = sources.DSMODE_LOCAL if args.local else sources.DSMODE_NETWORK | |||
|
|||
if mode == sources.DSMODE_NETWORK: | |||
if os.path.exists(init.paths.get_runpath(".wait-on-network")): | |||
LOG.debug("Will wait for network connectivity before continuing") | |||
init.distro.wait_for_network() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we now util.del_file("init.paths.get_runpath(".wait-on-network")) now that we've waited and succeeded on the wait_for_network() call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to? /run
gets cleaned on boot, and if a cloud-init collect-logs
will contain the file as another piece of evidence. I don't think it would hurt any further invocations of cloud-init on the current boot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason would be later calls on the command line to cloud-init init
per use-cases like -#5726 not needing to actually shell out to run that wait again.
cloudinit/cmd/main.py
Outdated
_should_wait_via_cloud_config(config) | ||
for config in [ | ||
datasource.get_userdata_raw(), | ||
datasource.get_vendordata_raw(), | ||
datasource.get_vendordata2_raw(), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a generator expression without having to calculate the truthy return values from userdata vedordata and vendordarta2? Then we can early exit on the first case of userdata/vendordata containing write_files or un-decodable file content without having to perform the extra calculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, good question. A generator expression won't accomplish that though.
I could do something like:
def _get_configs(datasource):
yield datasource.get_userdata_raw()
yield datasource.get_vendordata_raw()
yield datasource.get_vendordata2_raw()
def _should_wait_on_network(datasource):
...
return any(
_should_wait_via_cloud_config(config)
for config in _get_configs(datasource)
)
cloudinit/cmd/main.py
Outdated
if _should_wait_on_network(init.datasource): | ||
LOG.debug( | ||
"Network connectivity determined necessary for " | ||
"cloud-init's network stage" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Future me" in triage wants to know the reason we chose to wait on network. Can we log what condition triggered our network wait: no datasource vs non-#cloud-config user-data vs write_files in config. I also want this in logs to make it conspicuous why we are waiting in the event that we shouldn't be in some cases.
cloudinit/distros/debian.py
Outdated
@@ -220,6 +221,10 @@ def set_keymap(self, layout: str, model: str, variant: str, options: str): | |||
# be needed | |||
self.manage_service("restart", "console-setup") | |||
|
|||
def wait_for_network(self): | |||
"""Ensure that cloud-init's network service has network connectivity""" | |||
netplan.wait_for_network() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like this approach will work for broader environments than just netplan and just Ubuntu and/or Debian. I like that you are keeping it specific for the time being to environments which have netplan.
Don't we also want to confirm netplan.available()
before calling netplan.wait_for_network()
to avoid calling this on debian environments which don't have netplan.io installed?
This makes sense if the debian image contains netplan, but what happens on Debian images when they don't have have netplan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I was assuming it was fine for debian because of the netplan usage there too.
I think instead I should move this method into ubuntu.py. If netplan isn't available, we have nothing left to wait on. On Ubuntu, we can assume we have netplan available for supported images. If somebody goes off the beaten path in their image there's really no "right" thing to do other than log an error.
@@ -9,7 +9,9 @@ Wants=cloud-init-local.service | |||
Wants=sshd-keygen.service | |||
Wants=sshd.service | |||
After=cloud-init-local.service | |||
{% if variant not in ["debian", "ubuntu"] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is "safe to include debian in this conditional here. There are plenty of images, even cloud image containing cloud-init, which also don't contain netplan.io in the image: lxc launch images:debian/bookworm/cloud
. In these cases, I'd not want default packaging to drop this ordering on Debian.
cloudinit/net/netplan.py
Outdated
LOG.debug("NetworkManager is enabled, skipping networkd wait") | ||
return | ||
wait_online_def: str = subp.subp( | ||
["systemctl", "cat", "systemd-networkd-wait-online.service"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Given that this function isn't really netplan-speciifc, it's more systemd-networkd specific, doesn't this helper function actually belong over in cloudinit/net/systemd.py?
- If we happen to run in an environment which doesn't have the systemd-networkd-wait-online.service the systemctl cat will exit 1, we should figure out how we want to behave in that environment. Info or WARNING log and return without waiting?
- If called on a system without netplan.io, do we want this function to log that condition and just return without waiting?
Traditionally, cloud-init-network.service (previously cloud-init.service) waited for network connectivity (via systemd service ordering) before running. This has caused cloud-init-network.service to block boot for a significant amount of time. For the vast majority of boots, this network connectivity isn't required. This commit removes the ordering After=systemd-networkd-wait-online.service, but checks the datasource and user data in the init-local timeframe to see if network connectivity will be necessary in the init network timeframe. If so, when the init network service starts, it will call systemd-networkd-wait-online manually in the same manner that the systemd-networkd-wait-online.service does to wait for network connectivity. This commit affects Ubuntu only due to the various number of service orderings and network renderers possible, along with the downstream synchronization needed. However, a new overrideable method in the Distro class should make this optimization trivial to implement for any other distro.
…call into the activators.
@blackboxsw , thanks for the review. I applied your comments and also made a change to move the wait into the activators. |
return True, "random_seed command found" | ||
if parsed_yaml.get("mounts"): | ||
return True, "mounts found" | ||
return False, "cloud-config does not contain network requiring elements" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. thanks for the breakdown
return activators.select_activator(priority=priority) | ||
except activators.NoActivatorException: | ||
return None | ||
return activators.select_activator(priority=priority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change I think you've also missed handling the exception at the call-site in cloudinit/cmd/devel/hotplug_hook.py... Although I think that call-site in hotplug hook is also broken anyway before this PR if cloud-init can't find an activator as activator return value would be None, and we'd try to reference None.bring_up_interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, lack of exception handling there was intentional.
Proposed Commit Message
Additional Context
Test Steps
Merge type