CI: make upgrades more robust and skip 1m precautionary sleep #1904
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@manics I had 34/34 success
install jobs
but had1/2
failures onupgrade
jobs when I removed the 1 minute delay and I think I understand at least some parts of the issues. Here is a failure example of anupgrade
job.Background
There was a 1m wait introduced in #1896, and this PR aims to address this.
Issues considered
Autohttps pod's readiness
The autohttps pod doesn't have a readiness probe, it was removed as it is required to be Ready to receive responses for the ACME challenge, but from the perspective of users, this isn't sufficient as a criteria.
await_autohttps_tls_cert_acquisition during upgrades
We currently await with
await_autohttps_tls_cert_acquisition
which inspects the logs for the textIssued certificate
, but is this something that will show up if there isn't a need for issuing a certificate because there already exists one? I don't think so.Autohttps pod's secret-sync container is too slow for upgrades
When we first install and then upgrade the Helm chart, the autohttp's pod is supposed to stash away its certificate but it may not have time to do so before we upgrade to the next version. Due to this, we may end up re-acquiring a certificate when we shouldn't.
Rolling updates
When we make rolling updates of the autohttps pod, and the previous pod hasn't saved the cert into a secret, it won't get loaded by the new pod as it should. Then, it may end up trying to acquire a new cert, but traffic will potentially be redirected to the old pod in a ready but terminating state.
Fixes implemented
await_autohttps_tls_cert_acquisition
spot when the autohttps pod loads a certificate, no matter if it acquires it fresh from the ACME server or from a cache on disk from a mounted k8s Secret created by secret-sync.We previously inspected the logs of pebble rather than autohttps. If autohttps debug logs are enabled, we can make use of the statement
Added certificate
to spot when it successfully adds a certificate.await_autohttps_tls_cert_save
await the secret-sync's saving of the cert for nowOutcome
10/10 upgrade script attempts worked! I'll cleanup my debugging commits and merge this.