Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't delete non-existent resources in non-wildcard mode in delta xds #488

Merged
merged 28 commits into from
Sep 15, 2021

Conversation

s-matyukevich
Copy link
Contributor

@s-matyukevich s-matyukevich commented Aug 27, 2021

Before this PR we used to always add resource to the RemovedResources field of the DeltaDiscoveryResponse if it doesn't exist on the server. This causes problems in the following scenario:

  1. Control plane instructs envoy to create a cluster "foo" via CDS.
  2. Envoy sends DeltaDiscoveryRequest with type ClusterLoadAssignment and ResourceNamesSubscribe field set to []string{"foo"}
  3. At the moment control plane doesn't have any endpoints for the cluster "foo", so it sends DeltaDiscoveryResponse with RemovedResources name set to []string{"foo"}, which instructs envoy to unsubscribe from this resource.
  4. Envoy sends ACK and control plane creates a delta watch that doesn't watch for changes in "foo".
  5. After some time a few endpoints are registered in the control plane for the cluster "foo". Control plane updates corresponding cache (either simple or linear)
  6. Envoy doesn't receive any updates.

This PR fixes the issue by preventing resources from being deleted for non-wildcard requests. Instead, now we are going to wait for the client to unsubscribe from resources by itself. This works fine if the client always uses wildcard requests to pull initial resources and it also uses non-wildcard requests only to pull dependent resources (This is the case with envoy: it always uses wildcard requests to pull clusters and listeners and then non-wildcard requests to pull all other dependent resources)

s-matyukevich and others added 16 commits July 1, 2021 12:05
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Co-authored-by: Alec Holmes <alec.holmes@greymatter.io>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
…ent and doesn't exist on the server in delta xds

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@s-matyukevich
Copy link
Contributor Author

I spend some time trying to debug failing delta ADS integration tests. I think they are failing because of broken ADS ordering (in particular, cluster updates are send before listener updates, which is causing the following error unknown cluster 'cluster-v0-0') I thing the ordering was broken before this PR but this change makes it apparent. Looking more into that.

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@s-matyukevich s-matyukevich changed the title Respond with empty resource if it was explicitly requested by the client and doesn't exist on the server in delta xds Don't delete non-existent resources in non-wildcard mode in delta xds Aug 27, 2021
@s-matyukevich
Copy link
Contributor Author

I spend some time trying to debug failing delta ADS integration tests. I think they are failing because of broken ADS ordering (in particular, cluster updates are send before listener updates, which is causing the following error unknown cluster 'cluster-v0-0') I thing the ordering was broken before this PR but this change makes it apparent. Looking more into that.

Changing ADS ordering didn't help. No mater how we order resources we will always have some races. Here are 2 cases that I checked:

  1. Update clusters first and then update routes
    In this case after you updated clusters the current routes still reference old clusters, which doesn't exist anymore. We are getting 503 errors with unknown cluster 'cluster-v0-0' error message
  2. Update routes first and then update clusters.
    In this case after we update routes the reference clusters that doesn't exist yes. Some requests are still failing with 503 unknown cluster 'cluster-v1-0'

The only solution I can think of is to skip deleting non-existent routes (they are requested in non-wildcard mode). They still will be removed after envoy receives listener updates and will figure out that nobody is referencing old routes anymore.

@s-matyukevich
Copy link
Contributor Author

This PR should be ready for review, cc @snowp @alecholmez

@alecholmez
Copy link
Contributor

@s-matyukevich I'm not sure if you've seen this PR: #461 but maybe the ordering of the resources issue can be fixed by porting this over to delta

@alecholmez
Copy link
Contributor

@snowp your input might be valuable here, I remember us having a conversation on this bit of the logic but I can't remember the decisions as to why we decided to not keep the blank string entry

@s-matyukevich
Copy link
Contributor Author

@s-matyukevich I'm not sure if you've seen this PR: #461 but maybe the ordering of the resources issue can be fixed by porting this over to delta

Yeah, I saw this PR and implemented ordering. It didn't. help me to fix integration tests because I was also trying to implement this part of the spec (sent DeltaDiscoverResponse with empty Resource field for non-existent resources) In this case envoy indeed deleted the resource immediately, which was causing issue mo matter how we order resources.

After I updated the code and stopped sending anything for non-existent resources requested in non-wildcard mode, everything started working as expected. (We just wait for envoy to unsubscribed from resources that are no longer references by any other resource) This also works ok with any resource ordering, so I deleted the ordering code to keep this PR focused.

I am not sure whether and how we are supposed to implement the above mentioned part of the spec.

@snowp
Copy link
Contributor

snowp commented Aug 30, 2021

Haven't read the full thread but this doesn't seem right:

At the moment control plane doesn't have any endpoints for the cluster "foo", so it sends DeltaDiscoveryResponse with RemovedResources name set to []string{"foo"}, which instructs envoy to unsubscribe from this resource.

Envoy sends ACK and control plane creates a delta watch that doesn't watch for changes in "foo".

Whether or not a resource is subscribed to should not be up to the server at all, the client knows what resources it wants. removed_resources should be used to instruct the client that a resource that it is subscribed to and thinks exist no longer does, but once it's removed from the client the subscription should remain (unless the client includes it in a unsubscribe call).

@snowp
Copy link
Contributor

snowp commented Aug 30, 2021

@snowp your input might be valuable here, I remember us having a conversation on this bit of the logic but I can't remember the decisions as to why we decided to not keep the blank string entry

I don't recall what the semantics of empty string used to mean, but we should definitely be keeping track of resources that are currently being subscribed to but they don't have a current value.

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@s-matyukevich
Copy link
Contributor Author

Haven't read the full thread but this doesn't seem right:

At the moment control plane doesn't have any endpoints for the cluster "foo", so it sends DeltaDiscoveryResponse with RemovedResources name set to []string{"foo"}, which instructs envoy to unsubscribe from this resource.

Envoy sends ACK and control plane creates a delta watch that doesn't watch for changes in "foo".

Whether or not a resource is subscribed to should not be up to the server at all, the client knows what resources it wants. removed_resources should be used to instruct the client that a resource that it is subscribed to and thinks exist no longer does, but once it's removed from the client the subscription should remain (unless the client includes it in a unsubscribe call).

I updated the code accordingly to this comment. Now we keep the subscription active for the non-existent resources while still populating removed_resources field. I'll test this in our staging env tomorrow.

@s-matyukevich
Copy link
Contributor Author

@snowp @alecholmez I tested the latest version of this PR in our staging environment and verified that it indeed fixes the issue.

@alecholmez
Copy link
Contributor

Cool I think this satisfies everything we need. @snowp this look good to you? I have nothing else to add

Copy link
Contributor

@alecholmez alecholmez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comment nitpicks. Everything else looks good though

pkg/cache/v3/delta.go Outdated Show resolved Hide resolved
pkg/cache/v3/delta.go Outdated Show resolved Hide resolved
pkg/cache/v3/delta.go Outdated Show resolved Hide resolved
Co-authored-by: Alec Holmes <alec.holmes@greymatter.io>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@alecholmez
Copy link
Contributor

alecholmez commented Sep 3, 2021

Hmm it seems like envoy is seg faulting..

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@s-matyukevich
Copy link
Contributor Author

Hmm it seems like envoy is seg faulting..

#490 (comment)

I am pretty sure something is wrong with the latest envoy dev build.

@alecholmez
Copy link
Contributor

I've just notified the Envoy team about the segfault

@s-matyukevich
Copy link
Contributor Author

s-matyukevich commented Sep 7, 2021

I've just notified the Envoy team about the segfault

Thanks! This #493 also could be helpful (Basically, I did a binary search on recent envoy commits and figured out the one after which the build starts failing) I was also trying to use stack_decode.py to decode the stack trace, but I have some issues with building the envoy binary with debug symbols and using it in CI.

@alecholmez
Copy link
Contributor

Awesome thank you @s-matyukevich. Would you also mind alerting the security team as well? Here's a doc for how to do so

@s-matyukevich
Copy link
Contributor Author

Awesome thank you @s-matyukevich. Would you also mind alerting the security team as well? Here's a doc for how to do so

Sure, done.

@s-matyukevich
Copy link
Contributor Author

@alecholmez This now passes CI, anything else needs to be done before we can merge?

@alecholmez alecholmez self-requested a review September 14, 2021 18:37
Copy link
Contributor

@alecholmez alecholmez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last 2 things! I'll approve and merge once these are addressed

@@ -56,13 +56,19 @@ func createDeltaResponse(ctx context.Context, req *DeltaRequest, state stream.St
filtered = append(filtered, r)
}
nextVersionMap[name] = nextVersion
} else {
// we track non-existent resources for non-wildcard streams until the client explicitly unsubscribes from them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove this space?

@@ -56,13 +56,19 @@ func createDeltaResponse(ctx context.Context, req *DeltaRequest, state stream.St
filtered = append(filtered, r)
}
nextVersionMap[name] = nextVersion
} else {
// we track non-existent resources for non-wildcard streams until the client explicitly unsubscribes from them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// we track non-existent resources for non-wildcard streams until the client explicitly unsubscribes from them.
// We track non-existent resources for non-wildcard streams until the client explicitly unsubscribes from them.

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
@s-matyukevich
Copy link
Contributor Author

Last 2 things! I'll approve and merge once these are addressed

Sure, done.

@@ -39,6 +39,7 @@ ENVOY_PID=$!
function cleanup() {
kill ${ENVOY_PID} ${UPSTREAM_PID}
wait ${ENVOY_PID} ${UPSTREAM_PID} 2> /dev/null || true
cat ${ENVOY_LOG}
Copy link
Contributor

@alecholmez alecholmez Sep 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah can you remove this cat? We don't want to pollute this log here, i think this is just leftover from debugging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot about this, this is fixed now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you!

Signed-off-by: Sergey Matyukevich <s.matyukevich@gmail.com>
Copy link
Contributor

@alecholmez alecholmez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Thanks for all the cooperation @s-matyukevich 😄

@alecholmez alecholmez merged commit e483d0d into envoyproxy:main Sep 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants