-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overload: Reset H2 server stream only use codec level reset mechanism #18895
Conversation
stream finished. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/assign @alyssawilk Mind taking a look? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, this is a nice subtle corner case - thanks for fixing and testing!
@@ -537,7 +537,16 @@ void ConnectionImpl::ServerStreamImpl::resetStream(StreamResetReason reason) { | |||
buffer_memory_account_->clearDownstream(); | |||
} | |||
|
|||
StreamImpl::resetStream(reason); | |||
if (buffer_memory_account_ && buffer_memory_account_->sawEnvoyStreamComplete()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the underlying problem here, is that the filter manager believes the stream to be complete, and the codec does not? I think we don't want the buffer accounting to be the entity handling this disconnect because there could be other entities with this lifetime confusion.
either entities which can down-call a reset should be explicitly made aware of a stream being complete, or the filter manager should handle reset being called after it thinks that the stream is complete. Make sense?
cc @snowp for second opinion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Envoy level stream object (ConnectionManagerImpl::ActiveStream
is complete) and so the filter manager also thinks the stream is complete, but the codec doesn't as you said as it has yet to call onStreamClose
.
AFAIK, the only entities that'll try to reset the stream at this point is either the reset_high_memory_stream
action or the OnPendingFlushTimer
though in the latter case the codec will know the stream has ended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also cases in envoy mobile where we've had a reset-after-fin (Snow has reviewed) so again I'd rather make this a bit more resistant
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
if (buffer_memory_account_ && buffer_memory_account_->envoyStreamComplete()) { | ||
// We should use a lower level reset mechanism to reset the stream. | ||
resetStreamWorker(reason); | ||
if (parent_.sendPendingFramesAndHandleError()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some more offline discussion I'd suggested using local_end_stream_ rather than adding yet another way to communicate a stream is done. @KBaichoo thinks that's sketchy because of interplay between local_end_stream_ and the deferred reset code.
Pulling in @yanavlasov and @antoniovicente to comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interplay between local_end_stream and runResetCallbacks(reason)
I think can be sketchy (sorry if I didn't convey that well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change feels brittle.
I think we need to dig deeper and make object lifetimes more clear. We need to eliminate all cases where an object remains registered after deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's my thought too. @KBaichoo if we're in a hurry to get this landed I'd be OK merging with a TODO to fix if you can help resolve the real lifetime issues right after, but if we're not in a rush I think we want to sort out the lifetime issues now in a less fragile way.
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pushback. I've dug deeper and have done the following:
- Delay H1 ActiveRequest dtor until after the Envoy level stream is destroyed -- didn't have the invariant that codec stream outlived the envoy stream in local reply cases.
- With the above, it's safe to in
doDeferredStreamDestroy
access the response encoder (codec stream) and remove the envoy level stream from callbacks, and notify the codec level stream it's done. - Removed changes to buffer account interface
/retest |
Retrying Azure Pipelines: |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
stream destroyed. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
during reset. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
envoy/http/codec.h
Outdated
/** | ||
* Fired when the Envoy level stream representation has been destroyed. | ||
*/ | ||
virtual void onEnvoyStreamComplete() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear why we need this.
If the HCM is going to delete an Envoy stream, it generally has either sent end_stream or resets the stream. The only case I can think of where it doesn't, is where the connection is reset, where I don't know why the codec would be doing work. Can you clue me in on what the life cycle of the stream is where those calls are insufficient for the codec to realize it shouldn't be calling back to Envoy any more?
May be easier to hash this out in real time -feel free to snag time on my calendar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to maintain the invariant that the codec level stream outlives the Envoy level stream (and hence the Envoy level stream can safely unsubscribe from callbacks in doDeferredStreamDestroy in the HCM), this was added.
The above invariant wasn’t true in the case for H1 local reply, wherein we’d have a dangling pointer to the underlying codec at the Envoy level. See the diagram below starting at local reply encoded:
I tried to go down the route of leveraging local_end_stream_
instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).
https://github.com/envoyproxy/envoy/blob/main/source/common/http/codec_helper.h#L42-L43
I wonder if this is good enough for now. @mattklein123 WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @alyssawilk that I'm not thrilled about adding more tricky lifetime logic to logic that is already too tricky and hard to understand.
Just so I understand the actual problem we are fixing, what is the crashing call stack prior to this fix? I'm not sure I understand the original problem.
I tried to go down the route of leveraging local_end_stream_ instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).
Possibly related to my question above, but is the local reply case the case that is broken wrt to the overload manager? I'm trying to understand the sequence of events that leads to the lifetime issue. I'm wondering if there is a different fix that we could do that would allow us to make this work with local_end_stream_, etc.
/retest |
Retrying Azure Pipelines: |
Looks like the main branch needs to be merged. |
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/retest |
Retrying Azure Pipelines: |
… tied up with Watermark callbacks. Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
envoy/http/codec.h
Outdated
/** | ||
* Fired when the Envoy level stream representation has been destroyed. | ||
*/ | ||
virtual void onEnvoyStreamComplete() PURE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @alyssawilk that I'm not thrilled about adding more tricky lifetime logic to logic that is already too tricky and hard to understand.
Just so I understand the actual problem we are fixing, what is the crashing call stack prior to this fix? I'm not sure I understand the original problem.
I tried to go down the route of leveraging local_end_stream_ instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).
Possibly related to my question above, but is the local reply case the case that is broken wrt to the overload manager? I'm trying to understand the sequence of events that leads to the lifetime issue. I'm wondering if there is a different fix that we could do that would allow us to make this work with local_end_stream_, etc.
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/retest |
Retrying Azure Pipelines: |
I've now address the comments, implementing the fix on the top of the refactor in #19062. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Thanks for all your patience and refactoring reworking it - it's really going to help not just this use case but other lifetime issues!
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Will defer to @mattklein123 for merge on Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thank you.
* main: (77 commits) Fix verify_and_print_latest_release logic (envoyproxy#19111) http2: drain only once when reached max_requests_per_connection (envoyproxy#19078) Overload: Reset H2 server stream only use codec level reset mechanism (envoyproxy#18895) Update QUICHE from c2ddf95dc to 7f2d442e3 (envoyproxy#19095) tools: Fix dependency checker release dates bug (envoyproxy#19109) cve_scan: Use `envoy.dependency.cve_scan` (envoyproxy#19047) tcp: fix overenthusiastic bounds on the new pool (envoyproxy#19036) dep: update Proxy-Wasm C++ host (2021-11-18). (envoyproxy#19074) build(deps): bump frozendict from 2.0.7 to 2.1.0 in /tools/base (envoyproxy#19080) kafka: dependency upgrades (envoyproxy#18995) build(deps): bump charset-normalizer in /tools/dependency (envoyproxy#19105) build(deps): bump slack-sdk in /.github/actions/pr_notifier (envoyproxy#19093) dep: Remove dependency - six (envoyproxy#19085) Remove requested_server_name_ field from StreamInfo (envoyproxy#19102) broken link path fix for items http_filters/grpc_json_transcoder_filter (envoyproxy#19101) quic: turn off GRO (envoyproxy#19088) Listener: Add global conn limit opt out. (envoyproxy#18876) Specify type for matching Subject Alternative Name. (envoyproxy#18628) Fix a broken example in Lua filter docs (envoyproxy#19086) Fix a small typo (envoyproxy#19058) ... Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Signed-off-by: Kevin Baichoo kbaichoo@google.com
Commit Message: Overload: Reset H2 server stream use lower level reset mechanism
Additional Description: If we saw the Envoy level stream is complete, overload action
reset_high_memory_stream
should only use the codec level reset mechanism to reset the H2 stream.Risk Level: med
Testing: integration test
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
[Optional Runtime guard:]
Fixes #18551
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]