Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

cert-rotator: Add retry to cluster upgrade #1513

Closed
wants to merge 3 commits into from

Conversation

surajssd
Copy link
Member

Lot of failures that I have seen are in the upgrade path, so adding a retry here.

...
Ensuring controlplane component 'pod-checkpointer' is up to date... W0622 13:45:02.357960   17470 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0622 13:45:02.369443   17470 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0622 13:45:02.385962   17470 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Done.
Ensuring controlplane component 'kube-apiserver' is up to date... Failed!
    cert_rotate_disruptive_test.go:73: Rotating Certificates failed: running controlplane upgrade: upgrading controlplane component "kube-apiserver": updating controlplane component: an error occurred while finding last successful release. original upgrade error: Get "https://ci1624368234-pp.test.lokomotive-k8s.net:6443/apis/apps/v1/namespaces/kube-system/daemonsets/kube-apiserver": dial tcp 3.67.242.144:6443: connect: connection refused: Kubernetes cluster unreachable: Get "https://ci1624368234-pp.test.lokomotive-k8s.net:6443/version?timeout=32s": dial tcp 3.66.10.41:6443: connect: connection refused
--- FAIL: TestCertificateRotate (159.54s)
=== RUN   TestControlplaneComponentsDaemonSe
...

@surajssd surajssd marked this pull request as draft June 23, 2021 09:05
@surajssd surajssd force-pushed the surajssd/fix-aws-cert-rotate branch from eacd2f9 to 75e0bfb Compare June 23, 2021 09:29
@surajssd
Copy link
Member Author

The test was failing with the following error:

...
E0624 07:30:30.807781   19551 memcache.go:196] couldn't get resource list for tap.linkerd.io/v1alpha1: the server is currently unable to handle the request
E0624 07:30:30.816596   19551 memcache.go:101] couldn't get resource list for tap.linkerd.io/v1alpha1: the server is currently unable to handle the request
E0624 07:30:30.850619   19551 memcache.go:196] couldn't get resource list for tap.linkerd.io/v1alpha1: the server is currently unable to handle the request
E0624 07:30:30.859862   19551 memcache.go:101] couldn't get resource list for tap.linkerd.io/v1alpha1: the server is currently unable to handle the request
W0624 07:30:30.896272   19551 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0624 07:30:30.908244   19551 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
W0624 07:30:30.922323   19551 warnings.go:70] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
Done.
Ensuring controlplane component 'kube-apiserver' is up to date... Failed!
    cert_rotate_disruptive_test.go:73: Rotating Certificates failed: running controlplane upgrade: upgrading controlplane component "kube-apiserver": updating controlplane component: another operation (install/upgrade/rollback) is in progress
--- FAIL: TestCertificateRotate (1972.98s)
...

When I triggered the test again and got hold of the CI cluster, I found the usual culprit where apiserver chart stays in pending-upgrade state:

$ helm history kube-apiserver
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION
1               Mon Jun 28 07:09:23 2021        deployed        kube-apiserver-0.1.4    v1.21.1         Install complete
2               Mon Jun 28 07:23:04 2021        pending-upgrade kube-apiserver-0.1.4    v1.21.1         Preparing upgrade

All I did was a rollback and test went ahead just fine:

$ helm rollback kube-apiserver 1
Rollback was a success! Happy Helming!

@surajssd
Copy link
Member Author

Here is how I plan to fix the apiserver stuck in Pending state problem.

At this code block, we can check if it failed because the last release was stuck in Pending. If yes then rollback to last successful release and try upgrade again. We can have a 10 tries or something before we give up.

if _, err := update.Run(component, helmChart, values); err != nil {
fmt.Println("Failed!")
return fmt.Errorf("updating controlplane component: %w", err)
}

For the above to work, I am counting on #1515 to be merged!

@surajssd surajssd force-pushed the surajssd/fix-aws-cert-rotate branch 2 times, most recently from 95df917 to 65b673d Compare June 30, 2021 08:09
Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
Signed-off-by: Suraj Deshmukh <suraj@kinvolk.io>
@surajssd surajssd force-pushed the surajssd/fix-aws-cert-rotate branch from 65b673d to 9726d9c Compare June 30, 2021 09:10
@surajssd
Copy link
Member Author

Closing this PR, will test this code in a separate pipeline. Will leave the branch for testing.

@surajssd surajssd closed this Jun 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant