Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(dcutr): keep connection alive while we are using it #3960

Merged
merged 19 commits into from
Jun 4, 2023

Conversation

tcoratger
Copy link
Contributor

Description

Similar to #3876, we now compute connection_keep_alive based on whether we are still using the connection, applied to the dcutr protocol.

Related: #3844.

Notes & open questions

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this!

The tests are failing because the handler has nothing to do, right after we established the connection.

We can fix this by passing in "to be done work" in the constructor of the handler.
Check the event handler for established connections in the associated behaviour. We are queuing an event there using NotifyHandler (

ToSwarm::NotifyHandler {
). Instead of passing that event to the handler via NotifyHandler, we should instead pass it in via the constructor of the handler.

That way, it immediately has something to do on startup and returns KeepAlive::Yes. Does that make sense?

@tcoratger
Copy link
Contributor Author

Thanks for tackling this!

The tests are failing because the handler has nothing to do, right after we established the connection.

We can fix this by passing in "to be done work" in the constructor of the handler. Check the event handler for established connections in the associated behaviour. We are queuing an event there using NotifyHandler (

ToSwarm::NotifyHandler {

). Instead of passing that event to the handler via NotifyHandler, we should instead pass it in via the constructor of the handler.
That way, it immediately has something to do on startup and returns KeepAlive::Yes. Does that make sense?

@thomaseizinger Yes I get your point here, so that you mean replacing

self.queued_events.extend([
    ToSwarm::NotifyHandler {
        peer_id,
        handler: NotifyHandler::One(connection_id),
        event: Either::Left(handler::relayed::Command::Connect {
            obs_addrs: self.observed_addreses(),
        }),
    },
    ToSwarm::GenerateEvent(Event::InitiatedDirectConnectionUpgrade {
        remote_peer_id: peer_id,
        local_relayed_addr: match connected_point {
            ConnectedPoint::Listener { local_addr, .. } => local_addr.clone(),
            ConnectedPoint::Dialer { .. } => unreachable!("Due to outer if."),
        },
    }),
]);

by something like (roughly):

handler::relayed::Handler::new(connected_point, true);
self.queued_events.extend([
    ToSwarm::GenerateEvent(Event::InitiatedDirectConnectionUpgrade {
        remote_peer_id: peer_id,
        local_relayed_addr: match connected_point {
            ConnectedPoint::Listener { local_addr, .. } => local_addr.clone(),
            ConnectedPoint::Dialer { .. } => unreachable!("Due to outer if."),
        },
    }),
]);

where true is a boolean value representing "to be done work". Then I add this condition into the connection_keep_alive to trigger the KeepAlive::Yes just after connection is established, right?

@thomaseizinger
Copy link
Contributor

thomaseizinger commented May 18, 2023

Almost! Instead of adding an extra boolean, have a look at what the ConnectionHandler does when it receives the message from the behaviour:

fn on_behaviour_event(&mut self, event: Self::FromBehaviour) {
match event {
Command::Connect { obs_addrs } => {
self.queued_events
.push_back(ConnectionHandlerEvent::OutboundSubstreamRequest {
protocol: SubstreamProtocol::new(
protocol::outbound::Upgrade::new(obs_addrs),
(),
),
});
}
Command::AcceptInboundConnect {
inbound_connect,
obs_addrs,
} => {
if self
.inbound_connect
.replace(inbound_connect.accept(obs_addrs).boxed())
.is_some()
{
log::warn!(
"New inbound connect stream while still upgrading previous one. \
Replacing previous with new.",
);
}
}
Command::UpgradeFinishedDontKeepAlive => {
self.keep_alive = KeepAlive::No;
}
}
}

It adds an event to queued_events. If this queue is not empty, you are already returning KeepAlive::Yes in this PR! Thus, if you extend the constructor to take in the event itself and directly add it to the queue, all should be fine :)

Plus, this should allow us to remove the Connect message, making the implementation overall simpler!

@thomaseizinger
Copy link
Contributor

Plus, this should allow us to remove the Connect message, making the implementation overall simpler!

Unfortunately we can't because we still need it to trigger more attempts.

@thomaseizinger
Copy link
Contributor

We do something similar in the relay itself already:

fn handle_established_inbound_connection(
&mut self,
connection_id: ConnectionId,
peer: PeerId,
local_addr: &Multiaddr,
remote_addr: &Multiaddr,
) -> Result<THandler<Self>, ConnectionDenied> {
if local_addr.is_relayed() {
return Ok(Either::Right(dummy::ConnectionHandler));
}
let mut handler = Handler::new(self.local_peer_id, peer, remote_addr.clone());
if let Some(event) = self.pending_handler_commands.remove(&connection_id) {
handler.on_behaviour_event(event)
}
Ok(Either::Left(handler))
}

@thomaseizinger
Copy link
Contributor

thomaseizinger commented May 19, 2023

@tcoratger Did the comment above clear things up on how to proceed here?

@tcoratger
Copy link
Contributor Author

@tcoratger Did the comment above clear things up on how to proceed here?

@thomaseizinger Thank you for your explanations, I understood a little better but unfortunately, I do not yet have all the understanding of the nesting of functions on this subject, I must discover the library in more detail (I'm quite new on this).

  1. I understood what must happen, i.e. when the connection_keep_alive function is called as soon as the connection is set up, then the self.queued_events must not be empty at the risk of returning KeepAlive:: No, which would turn off the connection when it shouldn't.

  2. As with the ToSwarm::NotifyHandler, the event is already pushed to the handler, I still wonder why this basic implementation doesn't work: I guess because it's just a notification and the event does not have time to be pushed into the queue with the on_behaviour_event function, so connection_keep_alive reads an empty queue and therefore terminates the connection.

  3. Based on this principle, I understand your approach which aims to directly integrate the event into the constructor of the handler. I thus propose the following or similar approach:

// constructor with default values
 let mut handler = handler::relayed::Handler::new(connected_point.clone());
// push of the event (another approach can be adopted here)
  handler.on_behaviour_event(handler::relayed::Command::Connect {
      obs_addrs: self.observed_addresses(),
  });

  self.queued_events.extend([
      ToSwarm::GenerateEvent(Event::InitiatedDirectConnectionUpgrade {
          remote_peer_id: peer_id,
          local_relayed_addr: match connected_point {
              ConnectedPoint::Listener { local_addr, .. } => local_addr.clone(),
              ConnectedPoint::Dialer { .. } => unreachable!("Due to outer if."),
          },
      }),
  ]);

But using this approach, I don't understand how to push my handler thus created so that it appears in the queue of events transmitted from the behavior to the handler thereafter. The approach is not exactly the same as the one taken in handle_established_inbound_connection for the relay (for me) because here the handler is returned by the function and subsequently used to be pushed to the handler side I imagine.

In summary I have trouble seeing how to link the implementation of behavior with what is in the handler afterwards, probably because I lack experience in the library (that's why I try to choose themes that are not too complicated to begin to slowly familiarize myself with the codebase :)).

@thomaseizinger
Copy link
Contributor

No worries at all, the life-cycles are tricky to understand!

I'll see to push some follow-up patches so you can see what I mean :)

@thomaseizinger
Copy link
Contributor

@mxinden I pushed some follow-up commits here. The trickiest bit is that we need to keep the connection alive between the upgrade attempts. However, the probelab data showed that multiple upgrade attempts don't really increase the success rate.

We could simplify this a bit more if we remove the upgrade attempts. What do you think?

@thomaseizinger
Copy link
Contributor

@mxinden I pushed some follow-up commits here. The trickiest bit is that we need to keep the connection alive between the upgrade attempts. However, the probelab data showed that multiple upgrade attempts don't really increase the success rate.

We could simplify this a bit more if we remove the upgrade attempts. What do you think?

Another thing to discuss: Do we really need the events that notify us about a dcutr attempt? Those add code to the communication between handler and behaviour that is otherwise not needed.

@mxinden
Copy link
Member

mxinden commented May 29, 2023

Another thing to discuss: Do we really need the events that notify us about a dcutr attempt? Those add code to the communication between handler and behaviour that is otherwise not needed.

Initially I added them with the goal of exposing corresponding Prometheus metrics in libp2p-metrics. I think that is still a worthy goal.

@mxinden
Copy link
Member

mxinden commented May 29, 2023

@mxinden I pushed some follow-up commits here. The trickiest bit is that we need to keep the connection alive between the upgrade attempts. However, the probelab data showed that multiple upgrade attempts don't really increase the success rate.

We could simplify this a bit more if we remove the upgrade attempts. What do you think?

I would expect that we will re-introduce the retry logic in case we decide to implement libp2p/specs#487. Thus my preference, unless they add significant amount of complexity, is to keep the retry logic.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the work here.

Shall we land #3982 first?

remote_peer_id: event_source,
remote_relayed_addr: remote_addr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why no longer provide the remote_relayed_addr with the RemoteInitiatedDirectConnectionUpgrade event? That would remove the need for explicit state tracking in the NetworkBehaviour implementation and instead store the state close to its source, namely the connection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping. Am I missing something? This should eliminate the necessity for maintaining the state of Behaviour::remote_relayed_addr, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes-ish. The original idea was to not pass data back and forth. The behaviour learns about the address first so it is kind of redundant to pass it to the handler only to then pass it to the behaviour again.

On the flipside, not needing the hashmap removes a few error cases and simplifies the behaviour so I think it is a good idea.

@tcoratger Mind taking a look at this?

The idea would be to remove the peer_addresses hashmap and instead pass the address in the InboundConnectRequest enum from the handler to the behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thomaseizinger @mxinden Done. I hope I have understood everything correctly.

From what I have in mind, the peers_addresses hashmap of the Behavior structure was used to list peer addresses. It was therefore updated when establishing or closing a connection.

To replace this and avoid some bug-prone cases, we prefer to simplify the behavior and place the logic in the handler instead. So I started by completely removing peers_addresses which is no longer useful. Then I added the remote_addr inside InboundConnectRequest to pass the address information from the handler, closer to the source of the connection.

Don't hesitate if there's something tricky that I didn't catch.

@mergify
Copy link
Contributor

mergify bot commented May 31, 2023

This pull request has merge conflicts. Could you please resolve them @tcoratger? 🙏

@mxinden
Copy link
Member

mxinden commented May 31, 2023

The merging of #3982 introduced a tricky merge conflict here. @tcoratger I took a go at resolving it. Mind giving the latest merge commit a review?

@tcoratger
Copy link
Contributor Author

The merging of #3982 introduced a tricky merge conflict here. @tcoratger I took a go at resolving it. Mind giving the latest merge commit a review?

@mxinden Yes right, if I understand well your merge here:

  • You have added, in the handle_established_inbound_connection function, the handler using the constructor and pushed that to the queued_events to mimic the previous pattern that was inside the match in order to establish connection.
  • You put a direct_connections call in both handle_established_inbound_connection and handle_established_outbound_connection to push a direct and non relayed connection.

It sounds good to me, as I didn't code the most technical part of this PR, maybe @thomaseizinger can you take a look to see if everything is compatible with what you implemented?

@mergify
Copy link
Contributor

mergify bot commented Jun 1, 2023

This pull request has merge conflicts. Could you please resolve them @tcoratger? 🙏

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. Nice to see these simplifications!

protocols/dcutr/src/handler/relayed.rs Show resolved Hide resolved
remote_peer_id: event_source,
remote_relayed_addr: remote_addr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Friendly ping. Am I missing something? This should eliminate the necessity for maintaining the state of Behaviour::remote_relayed_addr, correct?

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you for the follow-ups! @thomaseizinger let us know if you feel strongly about moving away from the const.

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner the const being private to the behaviour but this is already a nice improvement overall so let's get it in!

@mergify mergify bot merged commit a4450d4 into libp2p:master Jun 4, 2023
@tcoratger tcoratger deleted the keepalive-dcutr branch June 4, 2023 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants