Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

Closed

Conversation

jxs
Copy link
Member

@jxs jxs commented Jan 5, 2023

Description

Addresses #2894

Notes

In the protocols, moved the TimedOut variants from nested UpgradeError's to Event variants themselves. Like identify dcutr and relay.
Per commit review is suggested.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

from ConnectionHandlerUpgrErr
@jxs jxs force-pushed the split-listen-upgrade-error branch from c64f931 to 70ddded Compare January 6, 2023 10:08
instead of ConnectionHandlerUpgrErr, on case of timeout expiration we
just log the error as we don't know which ConnectionHandler should
handle the error.
instead of ConnectionHandlerUpgrErr, and create ConnectionEvent::DialTimeout
for the situations where that upgrading an outbound substream to the given protocol has expired its timeout.
@jxs jxs force-pushed the split-listen-upgrade-error branch from 70ddded to c56bcc0 Compare January 6, 2023 10:11
@jxs jxs force-pushed the split-listen-upgrade-error branch from c56bcc0 to fd29912 Compare January 6, 2023 10:18
@jxs jxs force-pushed the split-listen-upgrade-error branch from fd29912 to d3291d7 Compare January 6, 2023 10:26
and DialUpgradeError updates.

gossipsub/handler: Move dial upgrade error logic from poll to
on_dial_upgrade_error and rename pending_errors to pending_events, so
that handler flow becomes more similar to other protocols.
 ListenUpgradeError and DialUpgradeError updates.
@jxs jxs force-pushed the split-listen-upgrade-error branch from d3291d7 to 021502c Compare January 6, 2023 10:59
@jxs jxs changed the title refactor(swarm): refactor ListenUpgradeError and DialUpgradeError refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError Jan 7, 2023
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the simplifications coming from this!

I don't quite agree with reporting a kind of error (timeout) next to other errors. I think that should be unified or not reported at all. I am tending towards the latter for a lot of protocols where we can make a decision on what that timeout means.

swarm/CHANGELOG.md Outdated Show resolved Hide resolved
continue;
}
ConnectionHandlerUpgrErr::Timeout => {
log::debug!("Timeout expired during an inbound substream negotiation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log::debug!("Timeout expired during an inbound substream negotiation")
log::debug!("Failed to negotiate protocol on inbound stream within {seconds}s")

That would be more useful IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconds doesn't seem to exist in this scope, we probably can make ConnectionHandlerUpgrErr::Timeout have it . I will implement after clarifying #3307 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seconds doesn't seem to exist in this scope, we probably can make ConnectionHandlerUpgrErr::Timeout have it . I will implement after clarifying #3307 (comment)

Yes, it was only meant to be illustrative on what I'd like to have printed!

Comment on lines 208 to +211
/// Informs the handler that upgrading an outbound substream to the given protocol has failed.
DialUpgradeError(DialUpgradeError<OOI, OP>),
/// Informs the handler that upgrading an outbound substream to the given protocol has expired its timeout.
DialTimeout(DialTimeout<OOI>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense for DialUpgradeError to be an enum with a Timeout and Upgrade variant?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3268 is related to this. If we agree to remove the OpenInfo from ConnectionHandler a lot of this code would be simpler because we wouldn't need to track info here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight, making DialUpgradeError an enum with a Timeout and Upgrade variant sounded similar to what we currently have, where ConnectionHandlerUpgrErr then has the variants :

pub struct DialUpgradeError<OOI, OP: OutboundUpgradeSend> {
pub info: OOI,
pub error: ConnectionHandlerUpgrErr<OP::Error>,
}

But what you are suggesting seems to be in fact renaming ConnectionHandlerUpgradeError to DialUpgradeError and using it in ConnectionEvent::DialUpgradeError right?

#[derive(Debug)]
pub enum ConnectionHandlerUpgrErr<TUpgrErr> {
/// The opening attempt timed out before the negotiation was fully completed.
Timeout,
/// Error while upgrading the substream to the protocol we want.
Upgrade(UpgradeError<TUpgrErr>),
}

In fact it probably makes sense (we are in fact reducing one indirection layer) even more sense than having Timeout as root variants where I agree with you that they are errors by themselves. Do you also agree @mxinden?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off the top of my head this sounds good to me. I will have to give this more thought with a clear mind to give a good answer.

swarm/src/lib.rs Show resolved Hide resolved
}

#[derive(Debug, Error)]
pub enum Error {
#[error("Failed to dial peer.")]
Dial,
#[error("Failed to establish substream: {0}.")]
Handler(ConnectionHandlerUpgrErr<void::Void>),
Handler(UpgradeError<void::Void>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we emitted an error (and thus closed the connection) when we ran into a timeout. Why are we changing this behaviour now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the behaviour the same? In the sense that the connection was closed because of a timeout and we are bubbling it up? But after reading and replying on #3307 (comment) I now see that Timeout makes sense being a variant on the Error itself of DirectConnectionUpgradeFailed rather than a variant of Event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the behaviour the same? In the sense that the connection was closed because of a timeout and we are bubbling it up?

Unless I am mistaken, the timeout refers to a stream upgrade. The connection can/should still be intact at this stage. Something like PendingUpgrade should trigger this timeout on the remote but the connection underneath works perfectly fine.

It does raise the question whether we should close an entire connection at all just because one protocol failed to upgrade a stream. Just disabling that protocol (and handler) and letting the collaborative keep-alive eventually close the connection seems like a better idea at first sight.

We may want to do this in a separate PR though, either before or after we merge this.

Comment on lines 468 to 469
/// Timeout expired while attempting to identify the remote.
Timeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A timeout is an error so it seems odd that we have an Error variant and then this. I'd suggest we either log this internally and don't report it or create a dedicated error enum ourselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are already reporting it up I'd go as we do in dcutr and expose it as an Error, updated to do that! Does it make sense to you to also do the same for relay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to you to also do the same for relay?

Yes I think we should apply that thinking to all protocols.

Copy link
Member Author

@jxs jxs Jan 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks done, ptal! Also removed what seems to be unrequired inbound_hop::UpgradeError and shortcircuit to inbound_hop::FatalUpgradeError see c2eda37 and see if you agree.

instead of as a variant of Event.
and Timeout as part of Error instead of as a variant of Event.
and Timeout as part of Error instead of as a variant of behaviour::Event.
Introduce `client::InboundError` and Timeout as part of Error instead of as a variant of client::Event.
@jxs jxs force-pushed the split-listen-upgrade-error branch from d4672d3 to c2eda37 Compare January 10, 2023 12:36
@thomaseizinger
Copy link
Contributor

Sorry for only mentioning this only now but it just came to me that once we have #2863, all of these errors will go away because there won't be any "upgrade" mechanism anymore for protocols defined through ConnectionHandlers.

Is it worth doing this refactoring here at all or should we just work towards #2863 instead? Personally, that would be my preference but it is also more work.

Thoughts?

@jxs
Copy link
Member Author

jxs commented Jan 10, 2023

Sorry for only mentioning this only now but it just came to me that once we have #2863, all of these errors will go away because there won't be any "upgrade" mechanism anymore for protocols defined through ConnectionHandlers.

Is it worth doing this refactoring here at all or should we just work towards #2863 instead? Personally, that would be my preference but it is also more work.

Thoughts?

No worries Thomas. Good question, no strong opinion though I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Jan 11, 2023

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right?

Yeah I thought so too which is why I brought it up. I updated the issue with a few more thoughts on how we could move forward there.

@thomaseizinger
Copy link
Contributor

Whilst hacking on #3201, I got very confused in how the error handling for connection and upgrade errors is implemented across our protocols.

At the highest level, we have ConnectionHandlerUpgrErr which can be emitted for two different cases:

  • Inbound streams
  • Outbound streams

Confusingly enough, we call those ListenUpgradeError and DialUpgradeError. Usually we only use the listen and dial terminology for connections but not for streams.

ConnectionHandlerUpgrErr has three variants:

/// Error that can happen on an outbound substream opening attempt.
#[derive(Debug)]
pub enum ConnectionHandlerUpgrErr<TUpgrErr> {
/// The opening attempt timed out before the negotiation was fully completed.
Timeout,
/// There was an error in the timer used.
Timer,
/// Error while upgrading the substream to the protocol we want.
Upgrade(UpgradeError<TUpgrErr>),
}

As you @jxs discovered in here, the Timer variant is not actually used.

The Timeout variant will go away once we ship #2863 because we won't be upgrading streams anymore.

What is left here is UpgradeError:

/// Error that can happen when upgrading a connection or substream to use a protocol.
#[derive(Debug)]
pub enum UpgradeError<E> {
/// Error during the negotiation process.
Select(NegotiationError),
/// Error during the post-negotiation handshake.
Apply(E),
}

Once we ship #2863, the Apply variant here should also go away because we will directly hand the stream to the user and not perform any processing on it.

Meaning the only error that can actually happen after #2863 is NegotiationError. This one is a also a bit odd because it combines the Failed variant with ProtocolError.

  • ProtocolError is fatal, it means some really bad error happened during our handshake
  • Failed means we couldn't agree on any protocol. This one is not necessarily fatal, it just means the remote does not support the protocol we requested.

I am drawing three conclusions from this:

  • We should prioritize Simplify ConnectionHandler trait by removing as many associated types as possible #2863 as it will make the entire error handling situation much clearer.
  • We are currently mixing fatal and non-fatal errors. Most of our protocols close the entire connection on any error. That is not very nice. A protocol should simply disable itself if the remote doesn't support it. I think the best way to achieve this is by splitting the "not supported" part into a separate error that users can just ignore for the most part.
  • Fatal errors should perhaps close the connection unconditionally? If we fail to negotiate protocols with an IO error or something similar, there isn't much a user can do about that. Why report it to the handlers?

@mxinden
Copy link
Member

mxinden commented Jan 18, 2023

Agreed with everything in the latest comment from @thomaseizinger. Thanks for looking so deeply into this.

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

I don't have an opinion on how to proceed, e.g. whether to first finish here, do #2863, or clean-up fatal and non-fatal errors. Up to you @jxs and @thomaseizinger.

@thomaseizinger
Copy link
Contributor

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

I don't have an opinion on how to proceed, e.g. whether to first finish here, do #2863, or clean-up fatal and non-fatal errors. Up to you @jxs and @thomaseizinger.

Any work that pushes #2863 forward would be much appreciated from my end!

@thomaseizinger thomaseizinger deleted the branch libp2p:3264-remove-inject-calls April 26, 2023 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants