Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dryrun before update as reconciliation option #358

Closed
wants to merge 2 commits into from

Conversation

eguzki
Copy link
Contributor

@eguzki eguzki commented Dec 5, 2023

What

DryRun update operation for all the updates can cause too much load for the api server. Having the "Dry Run Update First" as an option to be passed to the base reconciler may mitigate the issue. At least for some resources. This PR only applies the "DryRun Update first" only to the reconciliation of the authconfig because it is the only one having server side defaults (kubebuilder defaults) and mutating webhook involved.

@eguzki eguzki requested a review from a team as a code owner December 5, 2023 14:37
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

Attention: Patch coverage is 90.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 65.40%. Comparing base (0e08054) to head (8e8d31c).
Report is 275 commits behind head on main.

Files with missing lines Patch % Lines
pkg/reconcilers/base_reconciler.go 89.65% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #358      +/-   ##
==========================================
- Coverage   65.41%   65.40%   -0.02%     
==========================================
  Files          35       35              
  Lines        3843     3853      +10     
==========================================
+ Hits         2514     2520       +6     
- Misses       1132     1134       +2     
- Partials      197      199       +2     
Flag Coverage Δ
integration 70.27% <100.00%> (-0.41%) ⬇️
unit 60.01% <89.65%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
api/v1beta1 (u) ∅ <ø> (∅)
pkg/common (u) 76.92% <ø> (ø)
pkg/istio (u) 37.11% <ø> (ø)
pkg/log (u) 31.81% <ø> (ø)
pkg/reconcilers (u) 36.96% <89.65%> (+3.74%) ⬆️
pkg/rlptools (u) 56.46% <ø> (ø)
controllers (i) 70.27% <100.00%> (-0.41%) ⬇️
Files with missing lines Coverage Δ
controllers/authpolicy_authconfig.go 66.48% <100.00%> (+0.17%) ⬆️
pkg/reconcilers/base_reconciler.go 47.36% <89.65%> (+9.50%) ⬆️

... and 4 files with indirect coverage changes

// by dry-running the updates before
DryRunFirst *bool

GetOptions []client.GetOption
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not being used right now.

But I think that having them in place makes it easy for someone in the future to make use of them.

@alexsnaps alexsnaps added this to the v0.6.0 milestone Dec 5, 2023
@alexsnaps alexsnaps added target/next kind/enhancement New feature or request labels Dec 5, 2023
Copy link
Contributor

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change may open up for more of #354 in the future. It aims to protect performance and, at the same time, be conservative about the reconciliation processes affected by #356. I'm afraid it may backfire at both goals.

For every kind of resource the operator reconciles, to which a mutation on the state could go under the radar of the function that compares existing vs desired, thus mistakenly missing a no-op case and performing a unnecessary update, I would expect a dry-run first. If the operator watches those kinds of resources and, directly or indirectly, loops back to the point of reconciling them again, then we're subject to another endless loop at anytime.

Unfortunately, it’s hard to control all possible factors that could cause the resource mutating the state after created or updated. Defaults enforced via OAS are one, but it could also be a mutating webhook.

Mutating webhooks can be almost completely transparent. One could set them without any acknowledgement by the owner of the resource or by the provider of the CRD the resource is an instance of.

I personally feel safer with the dry-run ensuring that the state (and therefore the generation/resourceVersion) will not change, than I'm concerned with the extra request to the API server at this point.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 5, 2023

I disagree.

  • The dry run update is not a hard requirement. Only when doing "lazy" comparison like reflect.DeepEquals(existing, desired). Semantic comparison does not require doing extra dry run update.
  • It just feels wrong adding extra traffic load on the API server when it is not necessary. We did not detect any issue with other resources. If in the future, something changes and we detect endless loops, we know how to handle: 1) semantic comparison or 2) dry run update.

Take into account that the mutator has the existing object read from the cluster. And this object has all the mutations happened to it. So, the information is there.

Comment on lines +112 to +130
// DryRunFirst sets the "dry run first" option to "true", ensuring all
// validation, mutations etc first without persisting the change to storage.
var DryRunFirst = dryRunFirst{}

type dryRunFirst struct{}

// ApplyToList applies this configuration to the given list options.
func (dryRunFirst) ApplyToReconcile(opts *ReconcileOptions) {
opts.DryRunFirst = ptr.To(true)
}

var _ ReconcileOption = &dryRunFirst{}

// ReconcileOption adds some customization for the reconciliation process.
type ReconcileOption interface {
// ApplyToUpdate applies this configuration to the given update options.
ApplyToReconcile(*ReconcileOptions)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a matter of preference, not action needed, but usually find easier to read when types and interfaces are defined before are being used.

// validation, mutations etc first without persisting the change to storage.
var DryRunFirst = dryRunFirst{}

type dryRunFirst struct{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could define dryRunFirst as a bool to make it explicit? we could still attach methods to it or are we planning to expand the props of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not need to be a bool. When you add the ReconcileOption interface implementation called DryRunFirst, you are adding the following configuration option.

opts.DryRunFirst = ptr.To(true)

The bool field exists in the ReconcileOptions struct

@eguzki eguzki force-pushed the dry-run-first-reconciliation-option branch from 2c55208 to 8e8d31c Compare December 12, 2023 13:09
@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

After the meeting call, I am more convinced that the dry run update is no way at all. Maybe only for detecting mutating webhooks. Besides extra traffic load and the alternative approach of semantic comparison, there is the issue of comparing things the controller does not control. Let me put a simple example to illustrate:

int desired = D;
int existing = E;
int desiredMutated = mutate(desired);
if (desiredMutated != existing) {
    existing = desiredMutated
}

This is basically what our controller does with the dry run update. And this defeats the purpose of the controller to have existing = D which is all this is about.

So, I do not think this is even good for the authconfig use case. The authconfig use case should be fixed implementing the defaults.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

Another way to solve the endless loop without the dry run update is to add an annotation on the reconciled resource, i.e. the auhconfig object, with the generation version (the one that should only change on an update on the spec) of the owned resource, i.e. the authpolicy.

The reconciliation would not update the resource when the generation version does not change. Thus, breaking the endless loop.

For example, the generation version of the authpolicy is 1. The controller updates the authpolicy spec and adds the annotation with the version 1. When there is a reconciliation event, if the annotated generation is equal to the authpolicy's generation, the authconfig is not updated even though the spec does not match. This would prevent any healing from the controller when a external agent, mutating webhook or somebody doing manual update, changes the resource.

@guicassolato
Copy link
Contributor

I propose putting a pin on this one for now, while we research on different ways to work on the current implementation and tackle the issue of the overhead caused by too many dry-runs post #356.

Right now, with the ubiquitous dry-running of all updates before any blind reconciliation of existing vs. desired state every time, Kuadrant Operator's implemented behaviour is of immediately giving up on the fight against any additional authorised piece that disagrees with the operator regarding the desired state of those watched resources, thus letting the state determined by those other components to always win.

This is not much of a problem for defaults enforced at the level of the OAS, because these values can still be set by the operator and will prevail naturally over the defaults.

As for mutating webhooks that modify the state of the resources post-being requested by our operator, if the cluster administrator granted to those services enough permissions to act on the same resources managed by our operator, the operator will concede rather than keep fighting back.

An example of the above can be easily demonstrated by setting a mutating webhook that changes the value of a field originally set by the operator. A few steps to reproduce it, using Istio's EnvoyFilter CR as an example:

  1. Modify the EnvoyFilter reconciliation to inject an artificial time.Sleep call – so it simulates the operator always behind in the race for updating the resource against a mutating webhook, and therefore make the example less subject to chance.
  2. Deploy the Kuadrant Operator with the modification above
  3. Create a mutating webhook that patches the value of spec.configPatches.patch.value.connect_timeout of all EnvoyFilter CRs to 2s (conflicting with the 1s value hard-coded by the operator.) Example of this step leveraging OPA Gatekeeper:
    kubectl apply -f https://raw.githubusercontent.com/open-policy-agent/gatekeeper/v3.14.0/deploy/gatekeeper.yaml
    kubectl apply -f - <<EOF
    apiVersion: mutations.gatekeeper.sh/v1
    kind: Assign
    metadata:
      name: envoyfilter-connection-timeout
    spec:
      applyTo:
      - groups: ["networking.istio.io"]
        kinds: ["EnvoyFilter"]
        versions: ["v1alpha3"]
      match:
        source: All
        scope: Namespaced
        kinds:
        - apiGroups: ["networking.istio.io"]
          kinds: ["EnvoyFilter"]
      location: "spec.configPatches[applyTo: *].patch.value.connect_timeout"
      parameters:
        assign:
          value: "2s"
    EOF
  4. Deploy an instance of Kuadrant and a sample API
  5. Create a simple RateLimitPolicy targeting the HTTPRoute or Gateway to the API

With the changes proposed in the PR, the above will activate an endless loop of reconciliation fight between Kuadrant Operator and the OPA Gatekeeper mutating webhook (executed by the Gatekeeper controller), where the former wants to enforce a timeout of 1s for the connection between the gateway and Limitador, and latter wants to change that to 2s. Roughly, every few seconds (depending on the sleep call set at step 1), you should see Kuadrant Operator trying to reconcile the EnvoyFilter CR, followed by the Gatekeeper controller patching it back.

Without the changes proposed in this PR:

  1. Kuadrant Operator will create the EnvoyFilter CR with timeout set to 1s
  2. The webhook will patch it to 2s
  3. Kuadrant Operator will detect the change by the webhook and trigger another cycle of reconciliation of the gateway (focused on the EnvoyFilter object this time)
  4. Kuadrant Operator will compute the desired state of the EnvoyFilter CR with timeout set to 2s (by force of the dry-run update first)
  5. Kuadrant Operator will decide that the current state of the EnvoyFilter CR is already the desired one and will finish the reconciliation. The final value of the connection timeout will be 2s, as desired by the authorised webhook

A possible way to improve this flow in the future could be trying to make the operator aware of the conflict and somehow flag it so it can be solved from the outside. This of course opens up for other issues, such as what state to enforce in the meantime. Should all policies be invalidated?

Another possible enhancement involves relying more on caching in the operator, to not have to perform the dry-run every time, but keeping track of changes previously honoured instead, and avoid asking to the API server.

cc @chirino

@guicassolato
Copy link
Contributor

Another way to solve the endless loop without the dry run update is to add an annotation on the reconciled resource, i.e. the auhconfig object, with the generation version (the one that should only change on an update on the spec) of the owned resource, i.e. the authpolicy.

@eguzki, the state of the reconciled resource is computed out of multiple original resources, rather than only the policy, such as the Gateway and the HTTPRoute. Should we store the generations of all these origins of information in the annotations of the reconciled resource as well?

@guicassolato
Copy link
Contributor

This is basically what our controller does with the dry run update. And this defeats the purpose of the controller to have existing = D which is all this is about.

IMO, existing should not be D. D is what our controller wants. Some other component (also authorised by the cluster admin to exist) wants F = D + something.

F should be the final state, otherwise the cluster admin can prohibit the other component of setting + something with RBAC.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

Another way to solve the endless loop without the dry run update is to add an annotation on the reconciled resource, i.e. the auhconfig object, with the generation version (the one that should only change on an update on the spec) of the owned resource, i.e. the authpolicy.

@eguzki, the state of the reconciled resource is computed out of multiple original resources, rather than only the policy, such as the Gateway and the HTTPRoute. Should we store the generations of all these origins of information in the annotations of the reconciled resource as well?

Right, simple annotation does not fix the issue for all uses cases. Only for simple use cases.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

This is basically what our controller does with the dry run update. And this defeats the purpose of the controller to have existing = D which is all this is about.

IMO, existing should not be D. D is what our controller wants. Some other component (also authorised by the cluster admin to exist) wants F = D + something.

F should be the final state, otherwise the cluster admin can prohibit the other component of setting + something with RBAC.

This is an approach which can be valid for some use cases. I am struggling to see such use case, but I reckon it can be valid. The operator gives up and delegates the ownership always to external agents. For these particular cases, it is not desired resources, but proposed objects instead.

However, certainly, there is a more common use cases where the controller needs spec to be the desired one. And having the dry run update is breaking this, opening the door to deployment and functionality issues. So, with the ubiquitous dry-running of all updates we are applying the same approach for all. This PR specifically calls for having a reconciliation option for that. If you want the authconfig to be F = D + something, ok. But that is not true for the other resources.

@guicassolato
Copy link
Contributor

If you want the authconfig to be F = D + something, ok. But that is not true for the other resources.

The problem is you cannot tell, right? That something came from somewhere beyond the reach of the operator, yet authorised by the cluster admin. How do we fix that?

@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

One common use case is "create only" mutators, where objects are created and not doing any further update. The mutators are plain "return false" methods. For these cases, the API server is being hit with no reason.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 12, 2023

If you want the authconfig to be F = D + something, ok. But that is not true for the other resources.

The problem is you cannot tell, right? That something came from somewhere beyond the reach of the operator, yet authorised by the cluster admin. How do we fix that?

That's like two agents authorized by the cluster admin racing for updates on the same fields. It's a conflict. One of them must give up or be removed. The cluster admin does not have the context to know if mutating some resource may or may not break something. Regarding authpolicies, what if some external controller (kind of a virus) is updating some opa policy defined in the authpolicy on the authconfig object? This is a very big security issue. Shouldn't the kuadrant controller try to enforce the desired opa policy?

Moreover, if the authpolicy controller allows external updates of the authconfig, why watch for updates on the authconfig? removing the watcher, the endless loop issue is gone. The purpose of watching authconfig resource is to ensure the value of some fields (or the entire spec).

@guicassolato
Copy link
Contributor

two agents authorized by the cluster admin[…] what if some external controller (kind of a virus) is updating some opa policy defined in the authpolicy on the authconfig object?

You have to keep in mind that both agents were authorised, as you said it yourself. That in itself invalidates the example of the malicious software, which by definition is not authorised. Apart from the fact that it could just as easily target the source of truth directly, the one kind of resource that you have to be more permissive about.

What you are asking for is that the operator tries to protect the cluster admin against themselves, by allowing itself going into an infinite loop, completely wastefully, on the account that the "cluster admin does not have the context to know." Well, if the cluster admin does not know, they should not authorise two agents acting on the same resource. This is a very simple RBAC rule – "on namespace X, only Kuadrant Operator can touch EnvoyFilters, WasmPlugins, AuthorizationPolicies; across the cluster, only Kuadrant Operator can touch AuthConfigs, Limitadors." This is up for the cluster admin to decide. The Kuadrant Operator cannot be more powerful than cluster admins!

This will only create rigidness. Take the example of the hard-coded connection timeout. If Limitador fails to respond within 1s, first of all, Kuadrant sucks, but worse, it is unfixable. Somebody needs to change that line of code, cut a new release, build a new image, and, in the meantime, the system is unusable. There's no workaround.

@guicassolato
Copy link
Contributor

One common use case is "create only" mutators, where objects are created and not doing any further update. The mutators are plain "return false" methods. For these cases, the API server is being hit with no reason.

This is a very good point, which I can agree. If the Operator will not perform an update anyway (arguably something to reconsider), then neither it should execute the dry-run for those cases.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 13, 2023

I think this discussion is healthy and let us all learn and share concerns and vision. If somebody disagrees, shout out and I will stop writing text walls. That being said, my next 5 cents:

The Kuadrant Operator cannot be more powerful than cluster admins!

Of course not. Nobody is asking that. As well as nobody is asking the cluster admin to have knowledge in the scope of the kuadrant operator.

This is a very simple RBAC rule – "on namespace X, only Kuadrant Operator can touch EnvoyFilters, WasmPlugins, AuthorizationPolicies; across the cluster, only Kuadrant Operator can touch AuthConfigs, Limitadors."

This does not cover all use cases.

What you are asking for is that the operator tries to protect the cluster admin against themselves, by allowing itself going into an infinite loop, completely wastefully

First, the dry run update does not protect entirely from infinite loop. An additional controller updating the same fields leads to infinite loop, regardless of running dry run updates or not. A mutating webhook doing dynamic updates is also a source of endless infinite loop. regardless of running dry run updates or not. Think about a mutating webhook adding a timestamp in a field. The dry run update is very weak approach that can leak issues now and in the future. IIMO, the dry run update can only be valid for defaults. But only if us, as developers, are lazy and we do not want to implement defaults, which are known when they are owned CRD or not. And yes, they need to be maintained.

This will only create rigidness

On the contrary, this PR calls for freedom. The title literally says "dryrun before update as reconciliation option". Each controller can decide what to do. Rigidness is when all controller are obliged to run dry run updates not matter what. Furthermore, the rigidness is a call each controller makes to manage the trade-off between fully controller resources and create-only resources. Taking your example, if the connection time-out should be updated externally because it depends on infra and other factors, the call should be having the operator not reconciling that field (semantic comparison jumping in here). It should not be some mutating webhook. The limitador operator does exactly this call for the replicas field. https://github.com/Kuadrant/limitador-operator/blob/main/controllers/limitador_controller.go#L170-L172 . If the replicas are set in the CR, the operator is rigid. The deployment must have that value and if somebody changes that, the operator will revert the changes. The operator will be opinionated about the replicas when the field is set in the Limitador CR. However, if the replicas is not set in the Limitador CR, the operator is not opinionated about the number of replicas in the deployment. Anyone (with permissions) can go and change the deployment replicas. The operator will not revert the changes. That external controller can be a HorizontalPodAutoscaler controller replacing the replication controller. The whole point is that being rigid or not is a call each controller does. But the controller should be able to make that call. With the dry run update the controller no longer can ensure one value in the resource.

I am going to change the envoy filter controller to allow connection timeout to be externally updated ;)

@guicassolato
Copy link
Contributor

To be clear, I meant rigidness for one who uses Kuadrant, not for us (the developers of it.) I think that's undebatable.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 13, 2023

To be clear, I meant rigidness for one who uses Kuadrant, not for us (the developers of it.) I think that's undebatable.

To be rigid or not is controller's call. Try to update the deployment associated to the Gateway API Gateway resource (automatic deployment). Istio made a call, they do not accept updates on the deployment (at least in the command field). And they do not accept updates on the command line because you could break the service otherwise. On the other hand, they are not rigid on the annotations. You can add annotations on the Gateway resource and those will be propagated to the deployment, service and so on...

You, as a user, find some controller's behavior rigid? request a change on their code or behavior. Do not deploy a mutating webhook or a controller to update fields you do not own, even if you had permissions to do that.

@chirino
Copy link

chirino commented Dec 13, 2023

The apiserver and mutating webhooks have more authority about the final state of a resource than controllers since they have ultimate control of what resources look like when they are admitted into the system. From a controller's point of view there is no difference between the API server and a mutating web hook.

A controller can NEVER win against them. So if a controller does detect that the apiserver has changed his update, (we check the result of the update call to see if the API server mutated it). We should accept that as the final desired state since we can't win against the API server.

I'm 99% sure (can someone double check me) that the update call results can't contain an update from a different controller racing against our controller's update. We don't need to try to mitigate the reconcile loop that can occur from multiple controllers stepping on each other's changes. It is only the data race with the apiserver that concerns me.

@eguzki
Copy link
Contributor Author

eguzki commented Dec 13, 2023

For updates, the API server requires a baseline from which create a delta. So the API server requires the resourceVersion from which you want to update. If the current resource version in etcd database is not the one in the update request, the API server rejects the update.

For patch updates (server-side updates), there is no need for this. The request is itself the delta. Race conditions are handled by the API server instead of delegating it to the client as for the regular updates.

Does this answer your doubt?

@chirino
Copy link

chirino commented Dec 13, 2023

@eguzki that means that the controller looking at the result of the update call will NEVER see a change due to a different controller doing an update of the same resource right?

@eguzki
Copy link
Contributor Author

eguzki commented Dec 13, 2023

@eguzki that means that the controller looking at the result of the update call will NEVER see a change due to a different controller doing an update of the same resource right?

The controller will not see an update coming from another controller on the result of its own update. But will know that it lost the race and it needs to re-read the resource again before applying the changes to the latest version. That's regarding racing with another controller.

A mutating webhook is somewhat different. As you said, the mutating webhook is kind of associate of the API server. They work together side-by-side. Provided you win the race against another controller (or being the only one in the cluster), hence the resourceVersion provided is the latest one, if there are changes done by the APIServer (defaults) or any mutating webhook, those changes will be seen in the object after the update. It is basically what the dry run update does.

@alexsnaps alexsnaps modified the milestones: v0.6.0, v0.7.0 Dec 13, 2023
@chirino
Copy link

chirino commented Dec 13, 2023

RBAC might be the solution to avoiding controller/controller data races. We should not try to mitigate this type of data race since there is no clear way to solve this problem.

But the scenario of the apiserver/mutating webhook modifying your update is something we could cope with. Like you said it's similar to what dry run was doing. Perhaps we don't need to dry run at all. We should just analyse the result of the update to see it was changed from what we sent. If it is, we should cache the update result in some way so that we know that it is the final desired state after defaults/webhooks are applied.

@Boomatang
Copy link
Contributor

Reading over this thread I get the feeling that no matter what approach we take we can never fully solve the problem as the cluster admin can always grant more permission to something else. With that in mind, what metrics do we expose that indicates we have detected an infinite loop? And based on the metrics, do we give recommendations on what alerts should be created to help the admin monitor the state of the services on their cluster?

As the Authconfig #354 is currently the only one that is causing the infinite loop that is problem we should be trying to solve. If/when more resources start to cause loops then we should address those problems. At which point we would have more examples of what causes the issue and then can make a more informed decision on what the best mitigation steps would be.

@eguzki
Copy link
Contributor Author

eguzki commented Oct 8, 2024

No longer makes sense refactoring logic to the state of the world controller

related #916

@eguzki eguzki closed this Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants