Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task pool connection is not dropped in case of error #2661

Closed
stormshield-pj50 opened this issue May 20, 2022 · 13 comments
Closed

Task pool connection is not dropped in case of error #2661

stormshield-pj50 opened this issue May 20, 2022 · 13 comments

Comments

@stormshield-pj50
Copy link
Contributor

Summary

In swarm/src/connection/pool/task.rs:240 the call to the future closing_muxer is missing, thus the connection is never dropped and finally leaked.

Expected behaviour

The closing_muxer future should be awaited.

Actual behaviour

The closing_muxer future is not awaited. However the bug is not triggered with the usual transport layers but I am able to reproduce it while using my under development UDP transport layer.

Possible Solution

let error = closing_muxer.await.err().map(ConnectionError::IO);
...

But don't really know which error should be passed to EstablishedConnectionEvent::Closed below.

Version

Master

Would you like to work on fixing this bug?

The bug report should be enough.

@mxinden
Copy link
Member

mxinden commented May 23, 2022

In swarm/src/connection/pool/task.rs:240 the call to the future closing_muxer is missing,

If I recall correctly, the reasoning why _closing_muxer is not awaited is, that in this case an error happened on the connection, thus there is no use in properly closing the connection. That said, I do need to do more research in order to tell whether trying to close it anyways does any harm.

thus the connection is never dropped and finally leaked.

What do you mean by leaked? The task is terminating given the return statement:

The connection should be owned by the task and thus dropped as well. Am I missing something @Pj50?

@stormshield-pj50
Copy link
Contributor Author

If you don't close the connection how can a transport layer be notified in case of error ? I mean the TCP connection will be eventually closed on timeout or by the kernel but what if the underlying transport does not support it (in our example UDP) ? You're right, the connection should be dropped however in our UDP transport layer the drop on the connection is never called, we don't understand why and are still investigating.

@mxinden
Copy link
Member

mxinden commented May 29, 2022

If you don't close the connection how can a transport layer be notified in case of error ?

In this case it is the transport layer notifying the muxer that an error occured, right? If one ignores the layers, there would be no need to notify the notifier, right?

however in our UDP transport layer the drop on the connection is never called, we don't understand why and are still investigating.

Thank you!

@stormshield-pj50
Copy link
Contributor Author

Tested with QUIC v4 attempt. In case of a connection error the muxer is not awaited thus the poll_close StreamMuxer trait function of QuicMuxer is never called. Isn't it an issue ? The QUIC connection is not closed by libp2p but will eventually be closed later by Quinn on timeout.

@mxinden
Copy link
Member

mxinden commented Oct 3, 2022

Thanks for debugging. @elenaf9 can you expand on how this is handled in #2289?

@elenaf9
Copy link
Contributor

elenaf9 commented Oct 4, 2022

Thanks for debugging. @elenaf9 can you expand on how this is handled in #2289?

Thanks for the ping. Until now it was also handled by a timeout, i.e. the endpoint was not able do forward it to the connection and thus no ack was sent to the remote, eventually leading to a timeout. But I changed it now so that when the connection was dropped, the endpoint gets directly injected a Drained event which leads to a reset of the connection. See 1a7b0ff and 4d0d143.

@mxinden
Copy link
Member

mxinden commented Oct 5, 2022

@elenaf9 thanks for the quick and detailed response.

@Pj50 do you have some time to validate the above patches with #2289?

@stormshield-pj50
Copy link
Contributor Author

@elenaf9 sorry but the patches don't apply neither on v3 or v4 branches.

@elenaf9
Copy link
Contributor

elenaf9 commented Oct 10, 2022

Thanks for testing @Pj50! Could you please share the code with which you are testing, and the debug output that shows this?

@stormshield-pj50
Copy link
Contributor Author

@elenaf9 I cannot share the code I am working on, but you can reproduce the issue with two instances of libp2p using Quic, one as a listener and the other as a dialer. I use ping + identify as behaviour. Just kill the dialer, and the you will see that the connection is not closed on the swarm side, but the connection timeouts in Quinn. The following log should trigger on timeout:

2022-10-10 11:09:40 DEBUG libp2p_swarm                     - Connection closed with error IO(Custom { kind: Other, error: Quinn(TimedOut) }): Connected { endpoint: Listener { local_addr: "/ip4/127.0.0.1/udp/4001/quic", send_back_addr: 

@elenaf9
Copy link
Contributor

elenaf9 commented Oct 10, 2022

Just kill the dialer, and the you will see that the connection is not closed on the swarm side, but the connection timeouts in Quinn.

There might be a misunderstanding here. What is implemented (in quic-v3) is that if a connection is dropped on one side we send a RESET frame to the remote for that connection. That's possible because our quic endpoint is still polled after an individual connection was dropped and can continue sending packets to the remote (e.g. the RESET for the dropped connection) and handle / ack incoming ones. If the whole dialer is killed then not only the connection is dropped also the endpoint, so we can not sent a RESET to the remote or ACK any packets anymore. Thus this then results at a timeout at the remote, which imo makes sense.
If I understood your original question correctly, it was about individual connections being dropped while the application itself is still running. Am I missing something?

@thomaseizinger
Copy link
Contributor

Related: #1682

@thomaseizinger
Copy link
Contributor

I think this is now fixed with the new connection management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants