CI: make upgrades more robust and skip 1m precautionary sleep #1904

consideRatio · 2020-11-10T03:24:09Z

@manics I had 34/34 success install jobs but had 1/2 failures on upgrade jobs when I removed the 1 minute delay and I think I understand at least some parts of the issues. Here is a failure example of an upgrade job.

Background

There was a 1m wait introduced in #1896, and this PR aims to address this.

Issues considered

Autohttps pod's readiness

The autohttps pod doesn't have a readiness probe, it was removed as it is required to be Ready to receive responses for the ACME challenge, but from the perspective of users, this isn't sufficient as a criteria.

await_autohttps_tls_cert_acquisition during upgrades

We currently await with await_autohttps_tls_cert_acquisition which inspects the logs for the text Issued certificate, but is this something that will show up if there isn't a need for issuing a certificate because there already exists one? I don't think so.

Autohttps pod's secret-sync container is too slow for upgrades

When we first install and then upgrade the Helm chart, the autohttp's pod is supposed to stash away its certificate but it may not have time to do so before we upgrade to the next version. Due to this, we may end up re-acquiring a certificate when we shouldn't.

Rolling updates

When we make rolling updates of the autohttps pod, and the previous pod hasn't saved the cert into a secret, it won't get loaded by the new pod as it should. Then, it may end up trying to acquire a new cert, but traffic will potentially be redirected to the old pod in a ready but terminating state.

Fixes implemented

Let's make await_autohttps_tls_cert_acquisition spot when the autohttps pod loads a certificate, no matter if it acquires it fresh from the ACME server or from a cache on disk from a mounted k8s Secret created by secret-sync.
We previously inspected the logs of pebble rather than autohttps. If autohttps debug logs are enabled, we can make use of the statement Added certificate to spot when it successfully adds a certificate.
Let's make await_autohttps_tls_cert_save await the secret-sync's saving of the cert for now
Let's make make secret-sync work smarter and not rely on a interval to update

Outcome

10/10 upgrade script attempts worked! I'll cleanup my debugging commits and merge this.

…tests

jupyterhub/zero-to-jupyterhub-k8s#1904

consideRatio requested a review from manics November 10, 2020 03:47

consideRatio added 4 commits November 10, 2020 06:22

CI: remove 1m wait

0602262

CI: add colors explicitly for GitHub workflows

5aa2be7

CI: always provide info on failures

df4d7db

CI: add wait for autohttps acquired certs to be saved during upgrade …

685d521

…tests

consideRatio force-pushed the pr/attempt-to-avoid-1m-sleep branch from b5a313a to 685d521 Compare November 10, 2020 05:23

consideRatio added the maintenance label Nov 10, 2020

consideRatio changed the title ~~Attempt to avoid 1m sleep in CI~~ CI: Attempt to avoid 1m sleep Nov 10, 2020

consideRatio changed the title ~~CI: Attempt to avoid 1m sleep~~ CI: make upgrades more robust and skip 1m precautionary sleep Nov 10, 2020

consideRatio merged commit f529f88 into jupyterhub:master Nov 10, 2020

consideRatio pushed a commit to jupyterhub/helm-chart that referenced this pull request Nov 10, 2020

[jupyterhub] Automatic update for commit 0.10.2-n018.h115402d8

79b9924

jupyterhub/zero-to-jupyterhub-k8s#1904

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: make upgrades more robust and skip 1m precautionary sleep #1904

CI: make upgrades more robust and skip 1m precautionary sleep #1904

consideRatio commented Nov 10, 2020 •

edited

Loading

CI: make upgrades more robust and skip 1m precautionary sleep #1904

CI: make upgrades more robust and skip 1m precautionary sleep #1904

Conversation

consideRatio commented Nov 10, 2020 • edited Loading

Background

Issues considered

Autohttps pod's readiness

await_autohttps_tls_cert_acquisition during upgrades

Autohttps pod's secret-sync container is too slow for upgrades

Rolling updates

Fixes implemented

Outcome

consideRatio commented Nov 10, 2020 •

edited

Loading