Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPBUGS-25341]: perform operator apiService certificate validity checks directly #3217

Merged
merged 2 commits into from
Jun 20, 2024

Conversation

ankitathomas
Copy link
Contributor

Cert updates can occasionally fail silently, updating only the timestamps on the CSV without any changes to the underlying cert secret. This PR uses the cert expiry times directly to retry the refresh.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 3, 2024
@ankitathomas ankitathomas changed the title WIP: [OCPBUGS-25341]: perform operator apiService certificate validity checks directly [OCPBUGS-25341]: perform operator apiService certificate validity checks directly May 17, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 17, 2024
@ankitathomas ankitathomas force-pushed the cert-rotate branch 2 times, most recently from 5373cc0 to 1f59431 Compare May 21, 2024 13:39
Comment on lines -102 to -106
if !certs.Active(ca) {
logger.Warnf("CA cert not active")
errs = append(errs, fmt.Errorf("found the CA cert is not active"))
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're no longer checking the certs (here and below)?

Copy link
Contributor Author

@ankitathomas ankitathomas May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've combined the cert checks with freshness immediately after the APIService checks for successful/failed CSVs. We don't necessarily need to repeat the checks here again.

@ankitathomas ankitathomas force-pushed the cert-rotate branch 2 times, most recently from 129df3a to 1fc69f8 Compare May 21, 2024 22:23
func (i *StrategyDeploymentInstaller) ShouldRotateCerts(s Strategy) (bool, error) {
strategy, ok := s.(*v1alpha1.StrategyDetailsDeployment)
if !ok {
return false, fmt.Errorf("attempted to install %s strategy with deployment installer", strategy.GetStrategyName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this error a little more descriptive? To me it's just saying "we attempted an install" rather than "install failed because: ..." From what I can tell, looks like a failure due to a missing strategy? But I'm not super familiar with this code.

return false, fmt.Errorf("attempted to install %s strategy with deployment installer", strategy.GetStrategyName())
}

hasCerts := map[string]struct{}{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick and not required: we're using k8s.io/apimachinery/pkg/util/sets in v1 and inconsistently in v0. Might make things a little cleaner but it's not a big deal to me.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 23, 2024
Signed-off-by: Ankita Thomas <ankithom@redhat.com>
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 12, 2024
Copy link
Contributor

@tmshort tmshort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions right now.
However, and I realize this is pre-existing code, I don't like creating a certificate before checking it. The process should really be:

  1. Does the cert exist? No -> create the cert
  2. Is the cert need to be rotated? Yes -> rotate/create the cert

Right now, a cert is created every time the rotate check is to be done, and that is a BIG waste of CPU.

func HostnamesForService(serviceName, namespace string) []string {
return []string{
fmt.Sprintf("%s.%s", serviceName, namespace),
fmt.Sprintf("%s.%s.svc", serviceName, namespace),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not adding .cluster nor .cluster.local?

Copy link
Contributor Author

@ankitathomas ankitathomas Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should resolve even without the cluster domain specified, though there shouldn't be any harm in adding the fqdns for the default cluster domains. We could possibly also lookup the actual domain similar to pkg/package-server/client/util.go, though that may not be necessary

} else if apierrors.IsNotFound(err) {
return true, nil
} else {
return false, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's some other error, what happens? This does not rotate the certificate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it means we enter exponential backoff (at least until we hit the 5m install plan timeout)?

Any other error is likely to be either temporary (in which case it's hopefully resolved next time around), or not (in which case we would likely fail in a similar way trying to rotate the cert).

The only scenario that is coming to mind where Get secret fails, but a subsequent Create or Update would succeed would be if we lack read permission but have write permission.

Comment on lines 2210 to 2215
// Re-sync if kube-apiserver was unavailable
if apierrors.IsServiceUnavailable(installErr) {
logger.WithError(installErr).Info("could not update install status")
syncError = installErr
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved lower? If there's an error, shouldn't we check it first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a similar thought. Maybe we shouldn't timeout the install plan if the reason for things taking so long is that KAS is unavailable.

Copy link
Member

@joelanford joelanford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick pass and I think this looks good overall, but to be honest I'm a bit lost in the depths of this code.

The one question that I had that came to mind: how do we force ourselves to check if certs need to be rotated? If nothing else in the system is changing, are we doing any of:

  • re-reconcile if the cert secret changes
  • re-reconcile at a time in the future base on the current secrets' expiration time.

It could totally be that we already handle that part correctly, but it seems like an important detail to understand about this bug.

return false, fmt.Errorf("failed to install %s strategy with deployment installer: unsupported deployment install strategy", strategy.GetStrategyName())
}

hasCerts := sets.NewString()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is deprecated. The sets package now supports/prefers the generic type.

Suggested change
hasCerts := sets.NewString()
hasCerts := sets.New[string]()

timeouts after API availability checks, add service FQDN to list of
hostnames.

Signed-off-by: Ankita Thomas <ankithom@redhat.com>
@ankitathomas
Copy link
Contributor Author

* re-reconcile if the cert secret changes

We already do this for all olm managed secrets

* re-reconcile at a time in the future base on the current secrets' expiration time.

With each reconcile, we check all certs for anything that expires in a day or less and rotate all of those, including the ones that are already expired. The problem was that we were checking the cert freshness timestamps on the CSV to make those checks, and those were sometimes being incorrectly updated when the cert rotate hadn't really succeeded.

@joelanford
Copy link
Member

With each reconcile, we check all certs for anything that expires in a day or less and rotate all of those, including the ones that are already expired.

But I'm wondering if there's a scenario where:

  1. I install an operator that needs a cert
  2. OLM creates the secrets, sets expiration to now + N
  3. Everything reconciles, steady state is achieved, no further changes are made to the CSV.
  4. N time passes, and the certs expire. Still no changes, so still no cert renewal?

Is there something that forces a re-reconcile inside the time window where:

  • The certs are not yet expired, but
  • They are close enough to expiration that we'll rotate them.

Copy link
Contributor

@tmshort tmshort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Although we may want to look at the order of operations when a cert is to be rotated. As, this could lead to CPU spikes.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 18, 2024
@tmshort tmshort added this pull request to the merge queue Jun 18, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jun 18, 2024
@kevinrizza kevinrizza added this pull request to the merge queue Jun 20, 2024
Merged via the queue into operator-framework:master with commit 908da0c Jun 20, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants