Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: make upgrades more robust and skip 1m precautionary sleep #1904

Merged

Conversation

consideRatio
Copy link
Member

@consideRatio consideRatio commented Nov 10, 2020

@manics I had 34/34 success install jobs but had 1/2 failures on upgrade jobs when I removed the 1 minute delay and I think I understand at least some parts of the issues. Here is a failure example of an upgrade job.

Background

There was a 1m wait introduced in #1896, and this PR aims to address this.

Issues considered

Autohttps pod's readiness

The autohttps pod doesn't have a readiness probe, it was removed as it is required to be Ready to receive responses for the ACME challenge, but from the perspective of users, this isn't sufficient as a criteria.

await_autohttps_tls_cert_acquisition during upgrades

We currently await with await_autohttps_tls_cert_acquisition which inspects the logs for the text Issued certificate, but is this something that will show up if there isn't a need for issuing a certificate because there already exists one? I don't think so.

Autohttps pod's secret-sync container is too slow for upgrades

When we first install and then upgrade the Helm chart, the autohttp's pod is supposed to stash away its certificate but it may not have time to do so before we upgrade to the next version. Due to this, we may end up re-acquiring a certificate when we shouldn't.

Rolling updates

When we make rolling updates of the autohttps pod, and the previous pod hasn't saved the cert into a secret, it won't get loaded by the new pod as it should. Then, it may end up trying to acquire a new cert, but traffic will potentially be redirected to the old pod in a ready but terminating state.

Fixes implemented

  • Let's make await_autohttps_tls_cert_acquisition spot when the autohttps pod loads a certificate, no matter if it acquires it fresh from the ACME server or from a cache on disk from a mounted k8s Secret created by secret-sync.
    We previously inspected the logs of pebble rather than autohttps. If autohttps debug logs are enabled, we can make use of the statement Added certificate to spot when it successfully adds a certificate.
  • Let's make await_autohttps_tls_cert_save await the secret-sync's saving of the cert for now
  • Let's make make secret-sync work smarter and not rely on a interval to update

Outcome

10/10 upgrade script attempts worked! I'll cleanup my debugging commits and merge this.

image

@consideRatio consideRatio changed the title Attempt to avoid 1m sleep in CI CI: Attempt to avoid 1m sleep Nov 10, 2020
@consideRatio consideRatio changed the title CI: Attempt to avoid 1m sleep CI: make upgrades more robust and skip 1m precautionary sleep Nov 10, 2020
@consideRatio consideRatio merged commit f529f88 into jupyterhub:master Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant