Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta ADS updates with simple cache not working for SDS #613

Closed
AmitKatyal-Sophos opened this issue Dec 9, 2022 · 0 comments
Closed

Delta ADS updates with simple cache not working for SDS #613

AmitKatyal-Sophos opened this issue Dec 9, 2022 · 0 comments

Comments

@AmitKatyal-Sophos
Copy link
Contributor

AmitKatyal-Sophos commented Dec 9, 2022

Hi all,

Problem

We started using Delta ADS for LDS, CDS, RDS and SDS resources. We have three instances of envoy proxy client connected to the management server. During testing, we found that some of the envoy proxy's were not receiving the SDS updates.

Root cause analysis

We investigated the issue by adding logs to the go-control-plane and figured out that control plane is managing the list of the subscribed resources for each client and before sending the updates it compares the hash of the resource set in the snapshot with the hash of the same resource in the subscription list. If the specified resource is not specified in the client's subscription list, then updates are not sent to the respective client. And we found that somehow the entry from the subscription list was getting removed and as a result the updates were not getting pushed to the envoy proxy.

Why resources were getting removed from the subscription list ?

processDelta handler responsible for handling the client requests and responses updates the "watch.state.resourceVersions" at two places,

[1] As part of the subscription request from the envoy "s.subscribe(req.GetResourceNamesSubscribe(), watch.state.GetResourceVersions())"

[2] After sending the response to the envoy proxy "watch.state.SetResourceVersions(resp.GetNextVersionMap())"

Here, the problem is in the update of the resourceversion map after sending the response. After sending the response, we are directly replacing the resourceversion map with the map specified in the response. This is problem for the non wildcard resource type.

Please find below the detailed use-case to understand why replacing the resourceversion map inside the response handling is problematic.

  1. Envoy requests for the resource (resource-1)

  2. In the request handler, stream state resourceversion map is updated to [resource-1: empty hash] & response-1 is created with newresouceversion map [resource-1: hash]

  3. Before the response is sent to the envoy, envoy sends the new request for the resource (resource-2). In the request handler, stream state resourceversion map is updated to [resource-1: empty hash, resource-2 empty hash] and response-2 is created with newresouceversion map [resource-1: hash, resource-2: hash].

  4. Response-1 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash]. After sending the response-1, the stream state resourceversion map will be updated to [resource-1: hash]

  5. Envoy sends another request for the resource-3. In the request handler, stream state resourceversion map is updated to [resource-1: empty hash, resource-3 empty hash] and response-3 is created with newresouceversion map [resource-1: hash, resource-3: hash].

  6. Response-2 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash, resource-2: hash]. After sending the response-2, the stream state resourceversion map will be [resource-1: hash, resource-2: hash]

  7. Response-3 is sent to the envoy proxy and stream state version map is updated with the newresouceversion map [resource-1: hash, resource-3: hash]. After sending the response-3, the stream state resourceversion map will be [resource-1: hash, resource-3: hash]

Final resourceversion map will be [resource-1: hash, resource-3: hash]. This is incorrect as we lost the subscription to the resource-2. Hence, any config updates to the resource-2 will not be sent to the envoy proxy.

Proposed solutions

Solution - 1. In the request handler, we are starting the go routine to write the response to the channel. If instead of the writing the response to the channel in a separate go routine, we send the response to the envoy proxy and update the resourceversion map synchronously, then we should not hit the above mentioned race condition.
Remove below go routine and directly send the response.

  1. Solution - 2. If we don't replace the subscription list in the response handler and instead only update the hashes of the responses sent to the envoy proxy, then we will not hit this issue. We also need to ensure that we add this handling only the non wildcard resources.

watch.state.SetResourceVersions(resp.GetNextVersionMap())

Replace the above code with the below changes. I've tested the solution - 2 and looks cleaner. Kindly review the above analysis and possible solutions and provide your comments. If we agree with the analysis and the fix, then I will raise the pull request for your review.

  for k, hash := range resp.GetNextVersionMap() {
    if currHash, found := watch.state.GetResourceVersions()[k]; found {
      if currHash != hash {
        watch.state.GetResourceVersions()[k] = hash
      }
    }
  }
} else {
  watch.state.SetResourceVersions(resp.GetNextVersionMap())
}


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant