Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overload: Reset H2 server stream only use codec level reset mechanism #18895

Merged
merged 15 commits into from
Nov 29, 2021

Conversation

KBaichoo
Copy link
Contributor

@KBaichoo KBaichoo commented Nov 4, 2021

Signed-off-by: Kevin Baichoo kbaichoo@google.com

Commit Message: Overload: Reset H2 server stream use lower level reset mechanism
Additional Description: If we saw the Envoy level stream is complete, overload action reset_high_memory_stream should only use the codec level reset mechanism to reset the H2 stream.
Risk Level: med
Testing: integration test
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
[Optional Runtime guard:]
Fixes #18551
[Optional Fixes commit #PR or SHA]
[Optional Deprecated:]
[Optional API Considerations:]

stream finished.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

KBaichoo commented Nov 4, 2021

/assign @alyssawilk

Mind taking a look? Thanks!

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, this is a nice subtle corner case - thanks for fixing and testing!

envoy/buffer/buffer.h Outdated Show resolved Hide resolved
@@ -537,7 +537,16 @@ void ConnectionImpl::ServerStreamImpl::resetStream(StreamResetReason reason) {
buffer_memory_account_->clearDownstream();
}

StreamImpl::resetStream(reason);
if (buffer_memory_account_ && buffer_memory_account_->sawEnvoyStreamComplete()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the underlying problem here, is that the filter manager believes the stream to be complete, and the codec does not? I think we don't want the buffer accounting to be the entity handling this disconnect because there could be other entities with this lifetime confusion.

either entities which can down-call a reset should be explicitly made aware of a stream being complete, or the filter manager should handle reset being called after it thinks that the stream is complete. Make sense?
cc @snowp for second opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Envoy level stream object (ConnectionManagerImpl::ActiveStream is complete) and so the filter manager also thinks the stream is complete, but the codec doesn't as you said as it has yet to call onStreamClose.

AFAIK, the only entities that'll try to reset the stream at this point is either the reset_high_memory_stream action or the OnPendingFlushTimer though in the latter case the codec will know the stream has ended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also cases in envoy mobile where we've had a reset-after-fin (Snow has reviewed) so again I'd rather make this a bit more resistant

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
if (buffer_memory_account_ && buffer_memory_account_->envoyStreamComplete()) {
// We should use a lower level reset mechanism to reset the stream.
resetStreamWorker(reason);
if (parent_.sendPendingFramesAndHandleError()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some more offline discussion I'd suggested using local_end_stream_ rather than adding yet another way to communicate a stream is done. @KBaichoo thinks that's sketchy because of interplay between local_end_stream_ and the deferred reset code.
Pulling in @yanavlasov and @antoniovicente to comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interplay between local_end_stream and runResetCallbacks(reason) I think can be sketchy (sorry if I didn't convey that well)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change feels brittle.

I think we need to dig deeper and make object lifetimes more clear. We need to eliminate all cases where an object remains registered after deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's my thought too. @KBaichoo if we're in a hurry to get this landed I'd be OK merging with a TODO to fix if you can help resolve the real lifetime issues right after, but if we're not in a rush I think we want to sort out the lifetime issues now in a less fragile way.
/wait

Copy link
Contributor Author

@KBaichoo KBaichoo Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pushback. I've dug deeper and have done the following:

  • Delay H1 ActiveRequest dtor until after the Envoy level stream is destroyed -- didn't have the invariant that codec stream outlived the envoy stream in local reply cases.
  • With the above, it's safe to in doDeferredStreamDestroy access the response encoder (codec stream) and remove the envoy level stream from callbacks, and notify the codec level stream it's done.
  • Removed changes to buffer account interface

@KBaichoo
Copy link
Contributor Author

KBaichoo commented Nov 4, 2021

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18895 (comment) was created by @KBaichoo.

see: more, trace.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
stream destroyed.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
during reset.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
/**
* Fired when the Envoy level stream representation has been destroyed.
*/
virtual void onEnvoyStreamComplete() PURE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear why we need this.
If the HCM is going to delete an Envoy stream, it generally has either sent end_stream or resets the stream. The only case I can think of where it doesn't, is where the connection is reset, where I don't know why the codec would be doing work. Can you clue me in on what the life cycle of the stream is where those calls are insufficient for the codec to realize it shouldn't be calling back to Envoy any more?

May be easier to hash this out in real time -feel free to snag time on my calendar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to maintain the invariant that the codec level stream outlives the Envoy level stream (and hence the Envoy level stream can safely unsubscribe from callbacks in doDeferredStreamDestroy in the HCM), this was added.

The above invariant wasn’t true in the case for H1 local reply, wherein we’d have a dangling pointer to the underlying codec at the Envoy level. See the diagram below starting at local reply encoded:

Dangling H1

I tried to go down the route of leveraging local_end_stream_ instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).

https://github.com/envoyproxy/envoy/blob/main/source/common/http/codec_helper.h#L42-L43

I wonder if this is good enough for now. @mattklein123 WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @alyssawilk that I'm not thrilled about adding more tricky lifetime logic to logic that is already too tricky and hard to understand.

Just so I understand the actual problem we are fixing, what is the crashing call stack prior to this fix? I'm not sure I understand the original problem.

I tried to go down the route of leveraging local_end_stream_ instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).

Possibly related to my question above, but is the local reply case the case that is broken wrt to the overload manager? I'm trying to understand the sequence of events that leads to the lifetime issue. I'm wondering if there is a different fix that we could do that would allow us to make this work with local_end_stream_, etc.

@KBaichoo
Copy link
Contributor Author

KBaichoo commented Nov 9, 2021

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18895 (comment) was created by @KBaichoo.

see: more, trace.

@rojkov
Copy link
Member

rojkov commented Nov 15, 2021

Looks like the main branch needs to be merged.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18895 (comment) was created by @KBaichoo.

see: more, trace.

… tied up with Watermark

callbacks.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
source/common/http/http2/codec_impl.cc Show resolved Hide resolved
envoy/http/codec.h Outdated Show resolved Hide resolved
source/common/http/conn_manager_impl.cc Show resolved Hide resolved
/**
* Fired when the Envoy level stream representation has been destroyed.
*/
virtual void onEnvoyStreamComplete() PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with @alyssawilk that I'm not thrilled about adding more tricky lifetime logic to logic that is already too tricky and hard to understand.

Just so I understand the actual problem we are fixing, what is the crashing call stack prior to this fix? I'm not sure I understand the original problem.

I tried to go down the route of leveraging local_end_stream_ instead as suggested, e.g. if the HCM has sent the codec local_end_stream treat it as an implicit detach (don’t raise reset callbacks) but it seems to break an implicit invariant (breaking lots of integration tests).

Possibly related to my question above, but is the local reply case the case that is broken wrt to the overload manager? I'm trying to understand the sequence of events that leads to the lifetime issue. I'm wondering if there is a different fix that we could do that would allow us to make this work with local_end_stream_, etc.

Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
@KBaichoo
Copy link
Contributor Author

/retest

@repokitteh-read-only
Copy link

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #18895 (comment) was created by @KBaichoo.

see: more, trace.

@KBaichoo
Copy link
Contributor Author

I've now address the comments, implementing the fix on the top of the refactor in #19062.

Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Thanks for all your patience and refactoring reworking it - it's really going to help not just this use case but other lifetime issues!

source/common/http/conn_manager_impl.cc Outdated Show resolved Hide resolved
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Signed-off-by: Kevin Baichoo <kbaichoo@google.com>
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Will defer to @mattklein123 for merge on Monday.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Thank you.

@mattklein123 mattklein123 merged commit 3b45d6c into envoyproxy:main Nov 29, 2021
mpuncel added a commit to mpuncel/envoy that referenced this pull request Nov 30, 2021
* main: (77 commits)
  Fix verify_and_print_latest_release logic (envoyproxy#19111)
  http2: drain only once when reached max_requests_per_connection (envoyproxy#19078)
  Overload: Reset H2 server stream only use codec level reset mechanism (envoyproxy#18895)
  Update QUICHE from c2ddf95dc to 7f2d442e3 (envoyproxy#19095)
  tools: Fix dependency checker release dates bug (envoyproxy#19109)
  cve_scan: Use `envoy.dependency.cve_scan` (envoyproxy#19047)
  tcp: fix overenthusiastic bounds on the new pool (envoyproxy#19036)
  dep: update Proxy-Wasm C++ host (2021-11-18). (envoyproxy#19074)
  build(deps): bump frozendict from 2.0.7 to 2.1.0 in /tools/base (envoyproxy#19080)
  kafka: dependency upgrades (envoyproxy#18995)
  build(deps): bump charset-normalizer in /tools/dependency (envoyproxy#19105)
  build(deps): bump slack-sdk in /.github/actions/pr_notifier (envoyproxy#19093)
  dep: Remove dependency - six (envoyproxy#19085)
  Remove requested_server_name_ field from StreamInfo (envoyproxy#19102)
  broken link path fix for items http_filters/grpc_json_transcoder_filter (envoyproxy#19101)
  quic: turn off GRO (envoyproxy#19088)
  Listener: Add global conn limit opt out. (envoyproxy#18876)
  Specify type for matching Subject Alternative Name. (envoyproxy#18628)
  Fix a broken example in Lua filter docs (envoyproxy#19086)
  Fix a small typo (envoyproxy#19058)
  ...

Signed-off-by: Michael Puncel <mpuncel@squareup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stream Callbacks: Cleanup observers
5 participants