Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Add a certificate rotation command #1435

Merged
merged 18 commits into from
May 20, 2021
Merged

Conversation

meyskens
Copy link
Contributor

@meyskens meyskens commented Mar 17, 2021

Add a certificate rotation command

This builds on top of #1201.
It checks off the todo items and moves it into a separate command as well as add support for certificate rotation of etcd on all platforms.
Currently it will not rotate the private keys as that will cause some issues with the trust chain. It also doesn't properly set CA lengths which will allow for better rotation of the leaf certificates.

TODO

  • Add documentation on how to use it and how to check your certificates
  • Add documentation on disaster recovery of all expired certificates(other PR?)
  • Think about tests

How to use

  1. Install a cluster and set certs_validity_period_hours = 1
  2. Test if everything works
  3. Set certs_validity_period_hours = <something bigger>
  4. Run lokoctl cluster certificate rotate
  5. Check the ca inside the cluster and kubectl for validity
  6. Let it run for an hour, if everything still works it means all certs are rotated

Testing done

  • Tested on
    • bare metal VMs
    • AWS
    • Equinix Metal
    • Tinkerbell

@meyskens meyskens changed the title Meyskens/continue cert rotation Add a certificate rotation command Mar 17, 2021
@meyskens meyskens force-pushed the meyskens/continue-cert-rotation branch 2 times, most recently from c567c95 to 13c8b03 Compare March 17, 2021 15:30
@meyskens
Copy link
Contributor Author

Added extra commits, will squash them later

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add test for all that. Considering that all rotation functionality is exposed using cluster package, maybe we can import it for e2e tests and run the rotation from there? We don't need to test if rotation works, but we can test at least that it does not break the cluster :)

cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
docs/cli/lokoctl_cluster.md Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
cli/cmd/cluster/certificate-rotator.go Outdated Show resolved Hide resolved
@meyskens meyskens force-pushed the meyskens/continue-cert-rotation branch 5 times, most recently from 69664c2 to dcc4781 Compare March 23, 2021 12:25
@meyskens
Copy link
Contributor Author

I addressed all comments apart from a few that have to do with commits, will do those once I have e2e added to finalise the PR.
Further reviews welcome!

@meyskens meyskens force-pushed the meyskens/continue-cert-rotation branch 8 times, most recently from 598b2a4 to e66add5 Compare March 29, 2021 08:59
@iaguis iaguis linked an issue Apr 14, 2021 that may be closed by this pull request
Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@meyskens @invidian nice work, thanks!

Sorry, I didn't manage to review it all, it is quite a long PR. I haven't finished the first commit yet (although it seems the rest is simpler). For the future, I think the first commit can be split a lot and be way simpler to review.

Something that is not yet clear to me (maybe it is because I haven't finished reviewing) but won't components using service accounts need to be restarted too? What happens to the running compoenents if we rotate the cluster CA? Do they fail immediately? Also, not sure what happens with components using service accounts in other rotation cases (like not CAs, etc.).

Thanks again for the changes, will continue to review later. Leaving this rounds of comments now, though :)

pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/platform/packet/packet.go Outdated Show resolved Hide resolved
@meyskens meyskens force-pushed the meyskens/continue-cert-rotation branch from e66add5 to ede9672 Compare April 19, 2021 08:44
@meyskens
Copy link
Contributor Author

Something that is not yet clear to me (maybe it is because I haven't finished reviewing) but won't components using service accounts need to be restarted too? What happens to the running components if we rotate the cluster CA? Do they fail immediately? Also, not sure what happens with components using service accounts in other rotation cases (like not CAs, etc.)

Since service account tokens are volatile I do not think this is an issue in most cases as they apps that use them should be designed to pick up changes. To be sure i did an experiment with cert-manager and rotated the CA (validity 1hour to 1000hours) then watched it's behavior. It had some watchers close due to the api server going down but as it is designed to do it picks them up again and after the 1 hour passed the controller still works. The Kubernetes controller runtime has picked up the new CA certificate.

@meyskens meyskens force-pushed the meyskens/continue-cert-rotation branch 2 times, most recently from bfd8587 to cca7dba Compare April 21, 2021 11:39
pkg/platform/aks/aks.go Outdated Show resolved Hide resolved
test/components/util/util.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
pkg/k8sutil/rollout.go Outdated Show resolved Hide resolved
test/components/util/util.go Show resolved Hide resolved
@iaguis iaguis force-pushed the meyskens/continue-cert-rotation branch from f293aed to 33705f5 Compare May 19, 2021 16:29
This allows running terraform apply with -parallelism=1.
@iaguis iaguis force-pushed the meyskens/continue-cert-rotation branch from 33705f5 to a19ca4f Compare May 19, 2021 16:42
invidian and others added 9 commits May 20, 2021 00:21
So we can specify what Deployments and DaemonSets each platform uses.
This package adds functions to roll out Deployments and DaemonSets, with
proper waiting.

This code is mostly taken from kubectl code so we mimic its behavior.
When any of certificates used by etcd is changed, trigger
copy-controller-secrets null_resource to copy etcd certificates on all
controller nodes and then restart etcd to ensure it picks up new
certificates, as it can only automatically pick up new client
certificates and private key, but not the CA certificate.
This exposes the certs_validity_period_hours option to the bare-metal
module in lokocfg

Signed-off-by: Maartje Eyskens <maartje@kinvolk.io>
@iaguis iaguis force-pushed the meyskens/continue-cert-rotation branch from a19ca4f to 053ca05 Compare May 19, 2021 22:22
@invidian
Copy link
Member

This CI fail on Packet seems concerning:

cert_rotate_disruptive_test.go:73: Rotating Certificates failed: running controlplane upgrade: upgrading controlplane component "calico-host-protection": getting chart values from Terraform: failed to get controlplane component values.yaml from Terraform: json: cannot unmarshal array into Go value of type string

https://yard.lokomotive-k8s.net/builds/2473474

@iaguis
Copy link
Contributor

iaguis commented May 20, 2021

It seems we do need the join in the end?

Maartje Eyskens and others added 6 commits May 20, 2021 15:22
They were not set as output in Terraform for Packet this caused the
control plane check of lokoctl and certificate rotation to error.

Co-authored-by: Iago López Galeiras <iago@kinvolk.io>
Signed-off-by: Maartje Eyskens <maartje@kinvolk.io>
So it can be reused when rotating certificates.
This commit adds automated certificate rotation using the cluster
certificate rotate command.

'lokoctl cluster certificate rotate' taints all certificate resources
and updates terraform to re-issue them, it waits for all service account
token secrets to be updated with the new CA certificate, then triggers a
restart on all system deployments to make sure they pick up new CA
certificate. This is the core of the certificate rotation logic.

Co-authored-by: Maartje Eyskens <maartje@kinvolk.io>
Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Signed-off-by: Maartje Eyskens <maartje@kinvolk.io>
This adds an e2e test that rotates the certificates then checks:
* if all kubelets are up
* the kubeconfig got changed
* the API serves a longer valid cert

Signed-off-by: Maartje Eyskens <maartje@kinvolk.io>
@iaguis iaguis force-pushed the meyskens/continue-cert-rotation branch from 053ca05 to 228f504 Compare May 20, 2021 13:23
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if doing more reviews make sense at this point. If CI passes, I guess it's fine to merge.

@iaguis iaguis merged commit f2e4416 into master May 20, 2021
@iaguis iaguis deleted the meyskens/continue-cert-rotation branch May 20, 2021 14:54
@iaguis iaguis mentioned this pull request Jun 23, 2021
1 task
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement subcommand for certificate rotation and document it's usage
5 participants