Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(gossipsub): gracefully disable handler on stream errors #3625

Merged
merged 42 commits into from
Apr 14, 2023

Conversation

vnermolaev
Copy link
Contributor

@vnermolaev vnermolaev commented Mar 16, 2023

Description

Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection.

Instead of closing the connection, set KeepAlive::No.

Related: #3591.
Resolves: #3690.

Notes & open questions

Few items I'd like to raise.

  • Pieces of code such as
    return Poll::Ready(ConnectionHandlerEvent::Close(...));
    are replaced with self.keep_alive = KeepAlive::No and the error report. Unfortunately, I cannot meaningfully return from the poll function because it returns ConnectionHamdlerEvent and not an Option thereof.
  • Pieces of code such as
    loop {
        ...
        return Poll::Ready(ConnectionHandlerEvent::Close(...));
        ...
    }
    are replaced with self.keep_alive = KeepAlive::No and the error report. Again, it is impossible to return meaningfully; I just break; such an approach is also observed in the poll function.
    • sometimes errors are reported on the warn! level which is inconsistent with the proposed info! level

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

@vnermolaev
Copy link
Contributor Author

vnermolaev commented Mar 16, 2023

@thomaseizinger, Will you please comment on the open questions in the first message?

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
@vnermolaev vnermolaev force-pushed the deprecate/gossipsub-close-event branch from bdf4a47 to 153b4cc Compare March 17, 2023 10:34
@vnermolaev vnermolaev force-pushed the deprecate/gossipsub-close-event branch from 153b4cc to 1264345 Compare March 17, 2023 10:49
@thomaseizinger
Copy link
Contributor

Meta-note: Please don't force push, it makes review unnecessarily difficult.

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, we are making progress in the right direction here.

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
@thomaseizinger thomaseizinger changed the title fix(gossipsub): remove ConnectionHandlerEvent::Close fix(gossipsub): disable handler instead of closing connection on stream errors Mar 21, 2023
@thomaseizinger thomaseizinger changed the title fix(gossipsub): disable handler instead of closing connection on stream errors fix(gossipsub): gracefully disable handler on stream errors Mar 21, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Some more thoughts.

protocols/gossipsub/src/error_priv.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing this forward!

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/error_priv.rs Outdated Show resolved Hide resolved
@vnermolaev
Copy link
Contributor Author

I did not expect to make harmonization changes. So in the first commits, I kept the surface as minimal as possible.

@thomaseizinger
Copy link
Contributor

I did not expect to make harmonization changes. So in the first commits, I kept the surface as minimal as possible.

I figured while we are at it, might as well tidy up the code a bit :)

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good from my end. I would like @mxinden to also have a look.

@vnermolaev vnermolaev marked this pull request as ready for review March 22, 2023 09:38
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall change looks good to me. Thanks for the work.

//CC @divagant-martian and @AgeManning to make sure this does not break any assumptions in lighthouse.

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
@mxinden
Copy link
Member

mxinden commented Mar 22, 2023

@p-shahi and @2color this pull request should fix the issue of libp2p-gossipsub closing the connection due to an outbound stream upgrade timing out. Note that it does not fix the root cause, namely the outbound stream upgrade timing out. It will only prevent the closing of the connection.

libp2p/universal-connectivity#3

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Max!

protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Show resolved Hide resolved
swarm/CHANGELOG.md Outdated Show resolved Hide resolved
protocols/gossipsub/src/handler.rs Outdated Show resolved Hide resolved
@thomaseizinger
Copy link
Contributor

f999f3e splits the Handler into an EnabledHandler and a DisabledHandler making many invalid states impossible at compile time. In my eyes this makes it significantly easier to reason about the state of the handler and removes multiple edge cases. E.g. without this patch send_queue could have been growing unbounded in case the handler reached MAX_SUBSTREAM_CREATION but the gossipsub NetworkBehaviour continued pushing messages to the handler.

I know this is yet another large change to this already large pull request. To value everyone's time, @thomaseizinger would you mind taking a look first? Once this looks good to you we can ask for another larger round of reviews from other folks.

Yes, thank you for doing this. I think this makes a lot of sense. I am liking the direction this is going. Eventually, we might be able to add first-class support for "disabling" a handler to the swarm.

@mxinden
Copy link
Member

mxinden commented Apr 7, 2023

@thomaseizinger can you give this pull request another review?

thomaseizinger
thomaseizinger previously approved these changes Apr 10, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Should we extract the changes around void and ConnectionEvent into separate PRs?


if event.is_outbound() {
handler.outbound_substream_establishing = false;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

ConnectionEvent::FullyNegotiatedInbound(FullyNegotiatedInbound {
protocol,
..
}) => match protocol {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could flatten this match into the outer one.

@thomaseizinger
Copy link
Contributor

@vnermolaev I am sorry that we hijacked your PR like this 🙈

I hope you don't mind. We discovered several issues around handler errors with gossipsub and this PR helped us uncover those. Thanks for kicking it off and I hope it doesn't scare you back from contributing in the future :)


KeepAlive::Until(handler.last_io_activity + handler.idle_timeout)
}
Handler::Disabled(_) => KeepAlive::No,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this way it's possible we don't send the PeerKind::UnsupportedProtocol "on time" unless there is some kind of guarantee that poll will be called first (?). This event is important since we might want to get rid of (ban, or otherwise prevent connections to) this peer if it doesn't support what we need

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A handler is always completed drained of its events first before we even ask whether or not it should be shut down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See

loop {
match requested_substreams.poll_next_unpin(cx) {
Poll::Ready(Some(Ok(()))) => continue,
Poll::Ready(Some(Err(info))) => {
handler.on_connection_event(ConnectionEvent::DialUpgradeError(
DialUpgradeError {
info,
error: ConnectionHandlerUpgrErr::Timeout,
},
));
continue;
}
Poll::Ready(None) | Poll::Pending => {}
}
// Poll the [`ConnectionHandler`].
match handler.poll(cx) {
Poll::Pending => {}
Poll::Ready(ConnectionHandlerEvent::OutboundSubstreamRequest { protocol }) => {
let timeout = *protocol.timeout();
let (upgrade, user_data) = protocol.into_upgrade();
requested_substreams.push(SubstreamRequested::new(user_data, timeout, upgrade));
continue; // Poll handler until exhausted.
}
Poll::Ready(ConnectionHandlerEvent::Custom(event)) => {
return Poll::Ready(Ok(Event::Handler(event)));
}
Poll::Ready(ConnectionHandlerEvent::Close(err)) => {
return Poll::Ready(Err(ConnectionError::Handler(err)));
}
}
// In case the [`ConnectionHandler`] can not make any more progress, poll the negotiating outbound streams.
match negotiating_out.poll_next_unpin(cx) {
Poll::Pending | Poll::Ready(None) => {}
Poll::Ready(Some((info, Ok(protocol)))) => {
handler.on_connection_event(ConnectionEvent::FullyNegotiatedOutbound(
FullyNegotiatedOutbound { protocol, info },
));
continue;
}
Poll::Ready(Some((info, Err(error)))) => {
handler.on_connection_event(ConnectionEvent::DialUpgradeError(
DialUpgradeError { info, error },
));
continue;
}
}
// In case both the [`ConnectionHandler`] and the negotiating outbound streams can not
// make any more progress, poll the negotiating inbound streams.
match negotiating_in.poll_next_unpin(cx) {
Poll::Pending | Poll::Ready(None) => {}
Poll::Ready(Some((info, Ok(protocol)))) => {
handler.on_connection_event(ConnectionEvent::FullyNegotiatedInbound(
FullyNegotiatedInbound { protocol, info },
));
continue;
}
Poll::Ready(Some((info, Err(error)))) => {
handler.on_connection_event(ConnectionEvent::ListenUpgradeError(
ListenUpgradeError { info, error },
));
continue;
}
}
// Ask the handler whether it wants the connection (and the handler itself)
// to be kept alive, which determines the planned shutdown, if any.
let keep_alive = handler.connection_keep_alive();
match (&mut *shutdown, keep_alive) {
(Shutdown::Later(timer, deadline), KeepAlive::Until(t)) => {
if *deadline != t {
*deadline = t;
if let Some(dur) = deadline.checked_duration_since(Instant::now()) {
timer.reset(dur)
}
}
}
(_, KeepAlive::Until(t)) => {
if let Some(dur) = t.checked_duration_since(Instant::now()) {
*shutdown = Shutdown::Later(Delay::new(dur), t)
}
}
(_, KeepAlive::No) => *shutdown = Shutdown::Asap,
(_, KeepAlive::Yes) => *shutdown = Shutdown::None,
};
.

Copy link
Member

@mxinden mxinden Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to the above, we first "drain" (i.e. poll) the ConnectionHandler till it returns Poll::Pending:

// Poll the [`ConnectionHandler`].
match handler.poll(cx) {
Poll::Pending => {}
Poll::Ready(ConnectionHandlerEvent::OutboundSubstreamRequest { protocol }) => {
let timeout = *protocol.timeout();
let (upgrade, user_data) = protocol.into_upgrade();
requested_substreams.push(SubstreamRequested::new(user_data, timeout, upgrade));
continue; // Poll handler until exhausted.
}
Poll::Ready(ConnectionHandlerEvent::Custom(event)) => {
return Poll::Ready(Ok(Event::Handler(event)));
}
Poll::Ready(ConnectionHandlerEvent::Close(err)) => {
return Poll::Ready(Err(ConnectionError::Handler(err)));
}
}

Only then do we check the connection_keep_alive statsu:

// Ask the handler whether it wants the connection (and the handler itself)
// to be kept alive, which determines the planned shutdown, if any.
let keep_alive = handler.connection_keep_alive();
match (&mut *shutdown, keep_alive) {
(Shutdown::Later(timer, deadline), KeepAlive::Until(t)) => {
if *deadline != t {
*deadline = t;
if let Some(dur) = deadline.checked_duration_since(Instant::now()) {
timer.reset(dur)
}
}
}
(_, KeepAlive::Until(t)) => {
if let Some(dur) = t.checked_duration_since(Instant::now()) {
*shutdown = Shutdown::Later(Delay::new(dur), t)
}
}
(_, KeepAlive::No) => *shutdown = Shutdown::Asap,
(_, KeepAlive::Yes) => *shutdown = Shutdown::None,
};

@mxinden
Copy link
Member

mxinden commented Apr 11, 2023

@divagant-martian @AgeManning does either of you mind giving this another review?

@mxinden
Copy link
Member

mxinden commented Apr 12, 2023

@p-shahi with #3765 and #3767 merged into master and master merged into this pull request, this pull request again should fix all the issues that you are facing on https://github.com/libp2p/universal-connectivity/ with rust-libp2p. Mind updating your fork of rust-libp2p to the latest commit on this branch?

Copy link
Contributor

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, management of handler state as an enum seems like a great code quality improvement on top, ty @mxinden

@mxinden mxinden dismissed their stale review April 14, 2023 07:05

addressed with latest commits

@mxinden
Copy link
Member

mxinden commented Apr 14, 2023

@Mergifyio refresh

@mergify
Copy link
Contributor

mergify bot commented Apr 14, 2023

refresh

✅ Pull request refreshed

@mergify mergify bot dismissed stale reviews from AgeManning and thomaseizinger April 14, 2023 15:00

Approvals have been dismissed because the PR was updated after the send-it label was applied.

@mergify mergify bot merged commit 96288b8 into libp2p:master Apr 14, 2023
tcoratger pushed a commit to tcoratger/rust-libp2p that referenced this pull request Apr 14, 2023
Previously, we closed the entire connection upon receiving too many upgrade errors. This is unnecessarily aggressive. For example, an upgrade error may be caused by the remote dropping a stream during the initial handshake which is completely isolated from other protocols running on the same connection.

Instead of closing the connection, set `KeepAlive::No`.

Related: libp2p#3591.
Resolves: libp2p#3690.

Pull-Request: libp2p#3625.
@dariusc93
Copy link
Member

@mxinden The changelog from this PR may have to get updated since libp2p-gossipsub 0.44.3 didnt include this change before 0.51.3 patch release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

First outbound stream on WebRTC connection server side times out consistently
7 participants