Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for expected number of drivers starting API #233

Merged
merged 1 commit into from
Jan 5, 2021

Conversation

derekhiggins
Copy link
Member

@derekhiggins derekhiggins commented Dec 9, 2020

Currently the API will start even if the conductor is still
starting up, which can lead to failures if you try to deploy
nodes as soon as the API is accessible.

Co-Author: Steven Hardy shardy@redhat.com

This is a rework of a old PR to deal with this #122
but instead of looking at the drivers in the conductors table we look at hardware_types in conductor_hardware_interfaces

A change in the way ironic registers drivers is proposed here (this would also deal with the issue)
https://review.opendev.org/c/openstack/ironic/+/764911

@metal3-io-bot metal3-io-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 9, 2020
@derekhiggins
Copy link
Member Author

/hold
need to look into what needs to change if MARIADB_TLS_ENABLED=true
cc @hardys

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2020
@@ -4,4 +4,14 @@ export IRONIC_DEPLOYMENT="API"

. /bin/configure-ironic.sh

# Wait for ironic to load all expected drivers
# the DB query returns the number of unique hardware_type in the conductor_hardware_interfaces table
CONF_DRIVERS=$(crudini --get /etc/ironic/ironic.conf DEFAULT enabled_hardware_types | tr ',' '\n' | wc -l)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about how the plugins work in Ironic, but looking solely at the count makes me a little nervous.

Are there any default or builtin drivers that are always loaded, even if they are not listed in the config?

Would it make sense to compare the list of names, instead of just the count?

We've put some logic in the baremetal-operator to do a similar check, but it just waits for a non-zero count. Should that be made more sophisticated, too? That won't help the OpenShift installer, which the approach here does take care of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about how the plugins work in Ironic, but looking solely at the count makes me a little nervous.

Are there any default or builtin drivers that are always loaded, even if they are not listed in the config?

I don't think so, there is a default value for enabled_hardware_types if not mentioned in the config

Would it make sense to compare the list of names, instead of just the count?

I guess it would be more definite, although I'm not sure it would be worth the extra complexity

We've put some logic in the baremetal-operator to do a similar check, but it just waits for a non-zero count. Should that be made more sophisticated, too? That won't help the OpenShift installer, which the approach here does take care of.

Checking for a non-zero count is not reliable as the drivers are loaded individually, resulting in
https://bugzilla.redhat.com/show_bug.cgi?id=1902653

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be made more sophisticated, too?

To do this the BMO would need to know what the full list of drivers is

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the BMO we could look for the driver we're going to need for the host we're reconciling, so we don't need the full list. @andfasano did that work, I think, so maybe he has ideas about how to extend it to handle this case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a better place, but we could extract the driver name from the BareMetalHost.Spec.BMC.Address of the provisioned host and then check for it in the dependencies check here https://github.com/metal3-io/baremetal-operator/blob/d6eaf6774de775d83bdd47e04ea3d171ba7d4346/pkg/provisioner/ironic/dependencies.go#L66

Copy link
Member

@dtantsur dtantsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix ironic so that we can rely on the API rather than checking ironic internals.

@dhellmann
Copy link
Member

Please fix ironic so that we can rely on the API rather than checking ironic internals.

What would that fix look like? Doesn't the API already tell us which drivers are loaded, but the problem is we can get a different answer depending on when we ask during startup?

Currently the API will start even if the conductor is still
starting up, which can lead to failures if you try to deploy
nodes as soon as the API is accessible.

Co-Author: Steven Hardy <shardy@redhat.com>
@derekhiggins
Copy link
Member Author

Please fix ironic so that we can rely on the API rather than checking ironic internals.

Will look at this again, in the meantime I've pushed up a newer PR that will use ssl for mysql if available.

@dtantsur
Copy link
Member

What would that fix look like? Doesn't the API already tell us which drivers are loaded, but the problem is we can get a different answer depending on when we ask during startup?

I think @derekhiggins has discovered that there is small window where the drivers are registered but not usable yet. We need to close this window in ironic, I don't think adding a workaround downstream is a good idea.

@derekhiggins
Copy link
Member Author

What would that fix look like? Doesn't the API already tell us which drivers are loaded, but the problem is we can get a different answer depending on when we ask during startup?

Correct, we get a different answer depending on what iteration we are in on this loop
https://github.com/openstack/ironic/blob/b8e4aba/ironic/conductor/base_manager.py#L375

I think @derekhiggins has discovered that there is small window where the drivers are registered but not usable yet. We need to close this window in ironic, I don't think adding a workaround downstream is a good idea.

No (although I did think this at one stage), the problem is that drivers are registered one at a time, once the BMO sees the first one it starts creating nodes but the one it needs might not be there yet.

Dealing with this in the BMO would mean also doing it in terraform-provider-ironic, and either keeping a list of expected drivers in both or mapping the install-config BMC URL to the expected driver name in ironic.

@andfasano
Copy link
Member

What would that fix look like? Doesn't the API already tell us which drivers are loaded, but the problem is we can get a different answer depending on when we ask during startup?

Correct, we get a different answer depending on what iteration we are in on this loop
https://github.com/openstack/ironic/blob/b8e4aba/ironic/conductor/base_manager.py#L375

I think @derekhiggins has discovered that there is small window where the drivers are registered but not usable yet. We need to close this window in ironic, I don't think adding a workaround downstream is a good idea.

No (although I did think this at one stage), the problem is that drivers are registered one at a time, once the BMO sees the first one it starts creating nodes but the one it needs might not be there yet.

Dealing with this in the BMO would mean also doing it in terraform-provider-ironic, and either keeping a list of expected drivers in both or mapping the install-config BMC URL to the expected driver name in ironic.

Well, at least in BMO we can use the host details in the reconcile loop to look for just the desired driver, as also @dhellmann suggested

@dtantsur
Copy link
Member

Dealing with this in the BMO would mean also doing it in terraform-provider-ironic, and either keeping a list of expected drivers in both or mapping the install-config BMC URL to the expected driver name in ironic.

I think it's an error to try registering a node if its driver is not loaded yet.

If we do want to fix it here, let's use ironic API, not direct mysql access.

@derekhiggins
Copy link
Member Author

Also, I've uploaded a WIP to ironic which deals with this by registering all interfaces at once
https://review.opendev.org/c/openstack/ironic/+/764911

@hardys
Copy link
Member

hardys commented Dec 17, 2020

I'm biased since this is based on a workaround I previously proposed, but unless we're confident we can fix this at the ironic API level IMO it's a reasonable temporary workaround.

As mentioned the issue is the checks in both BMO and terraform are insufficient as they only check for non-zero drivers and they're not loaded atomically.

Instead we either need a way to check that all drivers specified in the config are loaded before attempting deployment (which this does, albeit in a crude way), or some way to check the needed driver is available before attempting to register/deploy a host - I can see that's fairly simple in the BMO case, but I suspect it will be more difficult at the terraform level because of the way the provider checks and host provisioning are decoupled /cc @stbenjam

@stbenjam
Copy link
Member

This seems like a reasonable workaround to me, everyone on slower hardware or using nested virt hits this.

@dhellmann
Copy link
Member

Yeah, I think we definitely need a fix here until we get a fix into a version of Ironic that we can use. My only concern was with implementing it with a count vs. names, but I can go along with a count if we think that's sufficient.

@andfasano
Copy link
Member

At least it shouldn't be worse than the previous implementation (in BMO, where the provisioner could have been considered ready even thought the required driver wasn't loaded)

@derekhiggins
Copy link
Member Author

I see we now have a BMO patch to deal with this metal3-io/baremetal-operator#755
@andfasano will this patch need a similar patch for the installer?

I'll take the ironic patch out of WIP and we can decide in either using it or the BMO patch, in the mean time we can merge this if we want something thats ready now.

@andfasano
Copy link
Member

I see we now have a BMO patch to deal with this metal3-io/baremetal-operator#755
@andfasano will this patch need a similar patch for the installer?

That's my expectation, but I've not yet checked it

@stbenjam
Copy link
Member

stbenjam commented Dec 17, 2020

Right, that is what @hardys was getting at. We need a change to the ironic terraform provider to do this in the installer. I am not sure we can know when we are waiting for the API what drivers the user will need, it probably requires digging around in internal data structures. I do prefer the solution in this PR.

@dhellmann
Copy link
Member

/test-integration
/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekhiggins, dhellmann

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [derekhiggins,dhellmann]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam
Copy link
Member

/test-integration

@derekhiggins
Copy link
Member Author

/test-integration

ironic seems to have done its thing correctly
2020-12-17 20:20:14.052 1 INFO ironic.conductor.task_manager [req-ca55b2aa-e0c8-4b46-a182-037fae0eb739 - - - - -] Node d0aa9e23-dfd2-47a2-a1f8-652864eb138d moved to provision state "active" from state "deploying"; target provision state is "None"

@derekhiggins
Copy link
Member Author

/test-integration

@hardys
Copy link
Member

hardys commented Jan 5, 2021

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 5, 2021
@derekhiggins
Copy link
Member Author

/unhold

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 5, 2021
@metal3-io-bot metal3-io-bot merged commit aa1ab28 into metal3-io:master Jan 5, 2021
elfosardo pushed a commit to elfosardo/ironic-image that referenced this pull request May 31, 2022
Bug 2023765: Compare IPs using the short form of IPv6 address
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants