refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

jxs · 2023-01-05T18:41:53Z

Description

Addresses #2894

Notes

In the protocols, moved the TimedOut variants from nested UpgradeError's to Event variants themselves. Like identify dcutr and relay.
Per commit review is suggested.

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

from ConnectionHandlerUpgrErr

instead of ConnectionHandlerUpgrErr, on case of timeout expiration we just log the error as we don't know which ConnectionHandler should handle the error.

instead of ConnectionHandlerUpgrErr, and create ConnectionEvent::DialTimeout for the situations where that upgrading an outbound substream to the given protocol has expired its timeout.

and DialUpgradeError updates

and DialUpgradeError updates. gossipsub/handler: Move dial upgrade error logic from poll to on_dial_upgrade_error and rename pending_errors to pending_events, so that handler flow becomes more similar to other protocols.

and DialUpgradeError updates.

ListenUpgradeError and DialUpgradeError updates.

thomaseizinger

I love the simplifications coming from this!

I don't quite agree with reporting a kind of error (timeout) next to other errors. I think that should be unified or not reported at all. I am tending towards the latter for a lot of protocols where we can make a decision on what that timeout means.

swarm/CHANGELOG.md

thomaseizinger · 2023-01-09T03:27:32Z

swarm/src/connection.rs

+                        continue;
+                    }
+                    ConnectionHandlerUpgrErr::Timeout => {
+                        log::debug!("Timeout expired during an inbound substream negotiation")


Suggested change

log::debug!("Timeout expired during an inbound substream negotiation")

log::debug!("Failed to negotiate protocol on inbound stream within {seconds}s")

That would be more useful IMO.

seconds doesn't seem to exist in this scope, we probably can make ConnectionHandlerUpgrErr::Timeout have it . I will implement after clarifying #3307 (comment)

seconds doesn't seem to exist in this scope, we probably can make ConnectionHandlerUpgrErr::Timeout have it . I will implement after clarifying #3307 (comment)

Yes, it was only meant to be illustrative on what I'd like to have printed!

thomaseizinger · 2023-01-09T03:31:43Z

swarm/src/handler.rs

    /// Informs the handler that upgrading an outbound substream to the given protocol has failed.
    DialUpgradeError(DialUpgradeError<OOI, OP>),
+    /// Informs the handler that upgrading an outbound substream to the given protocol has expired its timeout.
+    DialTimeout(DialTimeout<OOI>),


Would it make sense for DialUpgradeError to be an enum with a Timeout and Upgrade variant?

#3268 is related to this. If we agree to remove the OpenInfo from ConnectionHandler a lot of this code would be simpler because we wouldn't need to track info here.

At first sight, making DialUpgradeError an enum with a Timeout and Upgrade variant sounded similar to what we currently have, where ConnectionHandlerUpgrErr then has the variants :

rust-libp2p/swarm/src/handler.rs

Lines 340 to 343 in 5b8c3d2

pub struct DialUpgradeError<OOI, OP: OutboundUpgradeSend> {

pub info: OOI,

pub error: ConnectionHandlerUpgrErr<OP::Error>,

}

But what you are suggesting seems to be in fact renaming ConnectionHandlerUpgradeError to DialUpgradeError and using it in ConnectionEvent::DialUpgradeError right?

rust-libp2p/swarm/src/handler.rs

Lines 442 to 448 in 021502c

#[derive(Debug)]

pub enum ConnectionHandlerUpgrErr<TUpgrErr> {

/// The opening attempt timed out before the negotiation was fully completed.

Timeout,

/// Error while upgrading the substream to the protocol we want.

Upgrade(UpgradeError<TUpgrErr>),

}

In fact it probably makes sense (we are in fact reducing one indirection layer) even more sense than having Timeout as root variants where I agree with you that they are errors by themselves. Do you also agree @mxinden?

Off the top of my head this sounds good to me. I will have to give this more thought with a clear mind to give a good answer.

swarm/src/lib.rs

thomaseizinger · 2023-01-09T03:36:45Z

protocols/dcutr/src/behaviour_impl.rs

 }

 #[derive(Debug, Error)]
 pub enum Error {
    #[error("Failed to dial peer.")]
    Dial,
    #[error("Failed to establish substream: {0}.")]
-    Handler(ConnectionHandlerUpgrErr<void::Void>),
+    Handler(UpgradeError<void::Void>),


Previously, we emitted an error (and thus closed the connection) when we ran into a timeout. Why are we changing this behaviour now?

Isn't the behaviour the same? In the sense that the connection was closed because of a timeout and we are bubbling it up? But after reading and replying on #3307 (comment) I now see that Timeout makes sense being a variant on the Error itself of DirectConnectionUpgradeFailed rather than a variant of Event.

Isn't the behaviour the same? In the sense that the connection was closed because of a timeout and we are bubbling it up?

Unless I am mistaken, the timeout refers to a stream upgrade. The connection can/should still be intact at this stage. Something like PendingUpgrade should trigger this timeout on the remote but the connection underneath works perfectly fine.

It does raise the question whether we should close an entire connection at all just because one protocol failed to upgrade a stream. Just disabling that protocol (and handler) and letting the collaborative keep-alive eventually close the connection seems like a better idea at first sight.

We may want to do this in a separate PR though, either before or after we merge this.

thomaseizinger · 2023-01-09T03:40:28Z

protocols/identify/src/behaviour.rs

+    /// Timeout expired while attempting to identify the remote.
+    Timeout,


A timeout is an error so it seems odd that we have an Error variant and then this. I'd suggest we either log this internally and don't report it or create a dedicated error enum ourselves.

Since we are already reporting it up I'd go as we do in dcutr and expose it as an Error, updated to do that! Does it make sense to you to also do the same for relay?

Does it make sense to you to also do the same for relay?

Yes I think we should apply that thinking to all protocols.

thanks done, ptal! Also removed what seems to be unrequired inbound_hop::UpgradeError and shortcircuit to inbound_hop::FatalUpgradeError see c2eda37 and see if you agree.

instead of as a variant of Event.

and Timeout as part of Error instead of as a variant of Event.

and Timeout as part of Error instead of as a variant of behaviour::Event. Introduce `client::InboundError` and Timeout as part of Error instead of as a variant of client::Event.

thomaseizinger · 2023-01-10T13:28:20Z

Sorry for only mentioning this only now but it just came to me that once we have #2863, all of these errors will go away because there won't be any "upgrade" mechanism anymore for protocols defined through ConnectionHandlers.

Is it worth doing this refactoring here at all or should we just work towards #2863 instead? Personally, that would be my preference but it is also more work.

Thoughts?

jxs · 2023-01-10T19:12:21Z

Sorry for only mentioning this only now but it just came to me that once we have #2863, all of these errors will go away because there won't be any "upgrade" mechanism anymore for protocols defined through ConnectionHandlers.

Is it worth doing this refactoring here at all or should we just work towards #2863 instead? Personally, that would be my preference but it is also more work.

Thoughts?

No worries Thomas. Good question, no strong opinion though I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

thomaseizinger · 2023-01-11T03:12:18Z

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right?

Yeah I thought so too which is why I brought it up. I updated the issue with a few more thoughts on how we could move forward there.

thomaseizinger · 2023-01-17T06:15:50Z

Whilst hacking on #3201, I got very confused in how the error handling for connection and upgrade errors is implemented across our protocols.

At the highest level, we have ConnectionHandlerUpgrErr which can be emitted for two different cases:

Inbound streams
Outbound streams

Confusingly enough, we call those ListenUpgradeError and DialUpgradeError. Usually we only use the listen and dial terminology for connections but not for streams.

ConnectionHandlerUpgrErr has three variants:

rust-libp2p/swarm/src/handler.rs

Lines 433 to 442 in e3c7023

    
           /// Error that can happen on an outbound substream opening attempt. 
        
           #[derive(Debug)] 
        
           pub enum ConnectionHandlerUpgrErr<TUpgrErr> { 
        
               /// The opening attempt timed out before the negotiation was fully completed. 
        
               Timeout, 
        
               /// There was an error in the timer used. 
        
               Timer, 
        
               /// Error while upgrading the substream to the protocol we want. 
        
               Upgrade(UpgradeError<TUpgrErr>), 
        
           }

As you @jxs discovered in here, the Timer variant is not actually used.

The Timeout variant will go away once we ship #2863 because we won't be upgrading streams anymore.

What is left here is UpgradeError:

rust-libp2p/core/src/upgrade/error.rs

Lines 24 to 31 in e3c7023

    
           /// Error that can happen when upgrading a connection or substream to use a protocol. 
        
           #[derive(Debug)] 
        
           pub enum UpgradeError<E> { 
        
               /// Error during the negotiation process. 
        
               Select(NegotiationError), 
        
               /// Error during the post-negotiation handshake. 
        
               Apply(E), 
        
           }

Once we ship #2863, the Apply variant here should also go away because we will directly hand the stream to the user and not perform any processing on it.

Meaning the only error that can actually happen after #2863 is NegotiationError. This one is a also a bit odd because it combines the Failed variant with ProtocolError.

ProtocolError is fatal, it means some really bad error happened during our handshake
Failed means we couldn't agree on any protocol. This one is not necessarily fatal, it just means the remote does not support the protocol we requested.

I am drawing three conclusions from this:

We should prioritize Simplify ConnectionHandler trait by removing as many associated types as possible #2863 as it will make the entire error handling situation much clearer.
We are currently mixing fatal and non-fatal errors. Most of our protocols close the entire connection on any error. That is not very nice. A protocol should simply disable itself if the remote doesn't support it. I think the best way to achieve this is by splitting the "not supported" part into a separate error that users can just ignore for the most part.
Fatal errors should perhaps close the connection unconditionally? If we fail to negotiate protocols with an IO error or something similar, there isn't much a user can do about that. Why report it to the handlers?

mxinden · 2023-01-18T16:40:50Z

Agreed with everything in the latest comment from @thomaseizinger. Thanks for looking so deeply into this.

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

I don't have an opinion on how to proceed, e.g. whether to first finish here, do #2863, or clean-up fatal and non-fatal errors. Up to you @jxs and @thomaseizinger.

thomaseizinger · 2023-01-18T21:09:56Z

I think it probably doesn't makes sense to introduce this and release it, make libp2p users upgrade to this to then upgrade again with #2863 right? So I'd rather just jump to #2863 CC @mxinden wdyt?

I don't have an opinion on how to proceed, e.g. whether to first finish here, do #2863, or clean-up fatal and non-fatal errors. Up to you @jxs and @thomaseizinger.

Any work that pushes #2863 forward would be much appreciated from my end!

jxs requested review from mxinden and thomaseizinger January 5, 2023 18:41

This was referenced Jan 5, 2023

swarm/handler: Split inject_listen_upgrade_error into two #2894

Closed

refactor(swarm): remove deprecated inject calls #3264

Merged

swarm: remove unconstructed Timer variant

e63d5c7

from ConnectionHandlerUpgrErr

jxs force-pushed the split-listen-upgrade-error branch from c64f931 to 70ddded Compare January 6, 2023 10:08

jxs added 2 commits January 6, 2023 10:10

swarm: make ListenUpgradeError::error an UpgradeError

fbc6b82

instead of ConnectionHandlerUpgrErr, on case of timeout expiration we just log the error as we don't know which ConnectionHandler should handle the error.

swarm: make DialUpgradeError::error an UpgradeError

11db523

instead of ConnectionHandlerUpgrErr, and create ConnectionEvent::DialTimeout for the situations where that upgrading an outbound substream to the given protocol has expired its timeout.

jxs force-pushed the split-listen-upgrade-error branch from 70ddded to c56bcc0 Compare January 6, 2023 10:11

jxs added 4 commits January 6, 2023 10:13

dcutr: update to libp2p-swarm ListenUpgradeError

a793fb0

and DialUpgradeError updates

identify: update to libp2p-swarm ListenUpgradeError

05c5fd0

and DialUpgradeError updates

relay: update to libp2p-swarm ListenUpgradeError

6758ccd

and DialUpgradeError updates

metrics: update to libp2p-swarm ListenUpgradeError

24d4c9d

and DialUpgradeError updates

jxs force-pushed the split-listen-upgrade-error branch from c56bcc0 to fd29912 Compare January 6, 2023 10:18

jxs added 3 commits January 6, 2023 10:20

kad: update to libp2p-swarm ListenUpgradeError

fb51e22

and DialUpgradeError updates

rendezvous: update to libp2p-swarm ListenUpgradeError

a03e26c

and DialUpgradeError updates

ping: update to libp2p-swarm ListenUpgradeError

a403407

and DialUpgradeError updates

jxs force-pushed the split-listen-upgrade-error branch from fd29912 to d3291d7 Compare January 6, 2023 10:26

jxs added 3 commits January 6, 2023 10:59

gossipsub: update to libp2p-swarm ListenUpgradeError

bc2e6bb

and DialUpgradeError updates. gossipsub/handler: Move dial upgrade error logic from poll to on_dial_upgrade_error and rename pending_errors to pending_events, so that handler flow becomes more similar to other protocols.

request-response: update to libp2p-swarm ListenUpgradeError

4a0e5ee

and DialUpgradeError updates.

examples: update file-sharing example to libp2p-swarm

021502c

ListenUpgradeError and DialUpgradeError updates.

jxs force-pushed the split-listen-upgrade-error branch from d3291d7 to 021502c Compare January 6, 2023 10:59

jxs changed the title ~~refactor(swarm): refactor ListenUpgradeError and DialUpgradeError~~ refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError Jan 7, 2023

thomaseizinger reviewed Jan 9, 2023

View reviewed changes

jxs added 5 commits January 9, 2023 18:07

dcutr: report Timeout as a variant of Error

ec1cb53

instead of as a variant of Event.

identify: introduce Error enum

1bc39e2

and Timeout as part of Error instead of as a variant of Event.

metrics: fix removal of identify Event::Timeout

d70280a

relay: introduce behaviour::OutboundError enum

3a0db97

and Timeout as part of Error instead of as a variant of behaviour::Event. Introduce `client::InboundError` and Timeout as part of Error instead of as a variant of client::Event.

relay: remove unneeded inbound_hop::UpgradeError.

c2eda37

jxs force-pushed the split-listen-upgrade-error branch from d4672d3 to c2eda37 Compare January 10, 2023 12:36

thomaseizinger mentioned this pull request Jan 17, 2023

feat(swarm)!: Remove ConnectionHandler::Error in favor of CloseReason #3201

Closed

4 tasks

thomaseizinger deleted the branch libp2p:3264-remove-inject-calls April 26, 2023 16:39

thomaseizinger closed this Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

jxs commented Jan 5, 2023 •

edited

Loading

thomaseizinger left a comment

thomaseizinger Jan 9, 2023

jxs Jan 9, 2023

thomaseizinger Jan 9, 2023

thomaseizinger Jan 9, 2023

thomaseizinger Jan 9, 2023

jxs Jan 9, 2023

mxinden Jan 9, 2023

thomaseizinger Jan 9, 2023

jxs Jan 9, 2023

thomaseizinger Jan 9, 2023

thomaseizinger Jan 9, 2023

jxs Jan 9, 2023

thomaseizinger Jan 9, 2023

jxs Jan 10, 2023 •

edited

Loading

thomaseizinger commented Jan 10, 2023

jxs commented Jan 10, 2023

thomaseizinger commented Jan 11, 2023 •

edited

Loading

thomaseizinger commented Jan 17, 2023

mxinden commented Jan 18, 2023

thomaseizinger commented Jan 18, 2023

	log::debug!("Timeout expired during an inbound substream negotiation")
	log::debug!("Failed to negotiate protocol on inbound stream within {seconds}s")

	pub struct DialUpgradeError<OOI, OP: OutboundUpgradeSend> {
	pub info: OOI,
	pub error: ConnectionHandlerUpgrErr<OP::Error>,
	}

	#[derive(Debug)]
	pub enum ConnectionHandlerUpgrErr<TUpgrErr> {
	/// The opening attempt timed out before the negotiation was fully completed.
	Timeout,
	/// Error while upgrading the substream to the protocol we want.
	Upgrade(UpgradeError<TUpgrErr>),
	}

		/// Timeout expired while attempting to identify the remote.
		Timeout,

refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

refactor(swarm)!: refactor ListenUpgradeError and DialUpgradeError #3307

Conversation

jxs commented Jan 5, 2023 • edited Loading

Description

Notes

Change checklist

thomaseizinger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jxs Jan 10, 2023 • edited Loading

Choose a reason for hiding this comment

thomaseizinger commented Jan 10, 2023

jxs commented Jan 10, 2023

thomaseizinger commented Jan 11, 2023 • edited Loading

thomaseizinger commented Jan 17, 2023

mxinden commented Jan 18, 2023

thomaseizinger commented Jan 18, 2023

jxs commented Jan 5, 2023 •

edited

Loading

jxs Jan 10, 2023 •

edited

Loading

thomaseizinger commented Jan 11, 2023 •

edited

Loading