Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested SPIRE Architecture, NestedA workload invoke NestedB worload error in one case. #5317

Open
penghuazhou opened this issue Jul 20, 2024 · 6 comments
Assignees
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress

Comments

@penghuazhou
Copy link

penghuazhou commented Jul 20, 2024

How to occuor:
1、scale up a new Root Server pod, i will generate a new ca.
2、scale up a new NestedB Server pod, i will generate a new intermediate ca.
3、scale up NestedB Agent, should worload svid sign by the new ca.
4、NestedA workload invoke NestedB worload error.

Background knowledge:
1、A new intermediate certificate will be prepared for the intermediate and root certificate when ttl/2. This new intermediate or root certificate will only be activated at ttl/6.
2、When preparing the intermediate certificate, it will ensure that the root certificate is synchronized to the nested server before preparing the intermediate certificate successfully.
3、Spire agent synchronizes the trust certificate every 5 seconds.
4、Spire agent will notify the workload of trust certificate changes every 5 seconds to 8 minutes.

  • Version: 1.9.6
  • Platform: linux-amd64
  • Subsystem: spire-agent、spire-server、DataStore mysql、NodeAttestor k8s_psat、UpstreamAuthority spire、 Notifier k8sbundle

image

@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Jul 23, 2024
@MarcosDY
Copy link
Collaborator

Force rotation feature may be able to help you to update the current bundle intermediates inside each nested SPIRE,
this is still under development, you can track the status in force rotation project
Original issue: #1934

@penghuazhou
Copy link
Author

penghuazhou commented Jul 29, 2024

Force rotation feature may be able to help you to update the current bundle intermediates inside each nested SPIRE, this is still under development, you can track the status in force rotation project Original issue: #1934

@MarcosDY Force rotation feature update the current bundle intermediates inside each nested SPIRE, but alse need several seconds. During which the CA key generated by expanding the root server may have already issued a new nested server intermediate certificate, and the intermediate certificate may have already issued the workload's SVID. If this workload communicates with workloads that have not been synchronized to the bundle in a timely manner, it will cause TLS exceptions.

@penghuazhou
Copy link
Author

penghuazhou commented Jul 29, 2024

I think we have two solution to solve this problem, What solution will the community plan adopt to solve this problem? I can commit a pr.
1、If scale up spire-server, new spire-server pod can copy ca from old pod to solve this problem. New spire-server rotate ca independent。
2、let spire-server share a ca key. Spire-server which get lock can rocate ca.

@sorindumitru
Copy link
Contributor

I think the force rotation API by itself doesn't help, since it looks like you can only tell an existing server instance to prepare or rotate a CA. It would be good to have something (even within the force rotation APIs) that allows preparing a CA for use by a specific server instance at a later time. So you can:

  1. Prepare a CA for server instances, N+1 and N+2
  2. Wait for some amount of time for them to be propagated to all workloads
  3. Start instances N+1 and N+2 and have them use the prepared CA.

Alternatively maybe this could be something that the CLI command which starts the new instances up to when the CA is prepared and activated and then exits. That way you can run that new command as step 1 in the previous sequence.

@evan2645
Copy link
Member

Thanks for reporting this @penghuazhou and @sorindumitru for jumping in

We discussed this issue during SPIRE contributor sync today, and the consensus is that liveness and readiness checks in SPIRE should be solving this problem (but don't currently). When a new SPIRE Server boots at the root, readiness check should fail for ~some amount of time to allow the new root to propagate. After it's propagated, the readiness check can succeed and signing can begin.

I think there's a couple gotchas that need to be figured out as part of this work:

  • Probably don't want to do this if it's the first SPIRE Server being turned on ... under what conditions do we want the behavior, and when do we want to shortcut?
  • For how long should the readiness check fail? I feel there's no "good" answer because root servers don't have a full picture of bundle propagation and thus the decision will be open-loop

I'll move this issue to the backlog as unscoped ... once we have answers to the above two questions, I think we'll better understand the scope and be ready to accept the change. Thank you for volunteering to work on this @penghuazhou! I will go ahead and assign it to you as well.

@evan2645 evan2645 added priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress and removed triage/in-progress Issue triage is in progress labels Jul 30, 2024
@penghuazhou
Copy link
Author

Thanks,I'm glad to be able to participate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Issue is approved and in the backlog unscoped The issue needs more design or understanding in order for the work to progress
Projects
None yet
Development

No branches or pull requests

4 participants