`protocols/mdns` infinite loop? #3390

michalkucharczyk · 2023-01-26T13:09:31Z

Summary

It seems there is a busy loop, which never terminates when the process is being shutdown. I observe the problem in polkadot node.

Expected behaviour

The loop should end in case of erroneous conditions.

Actual behaviour

With RUST_LOG=trace, I see logs flooding the terminal. Parts of log are here. (However the problem is not strictly related to this PR, it can be reproduced by Ctrl-C during startup.)
The single log iteration is as follows:

2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling Connection    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: reading incoming messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_read_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling socket    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::codecs: NetlinkCodec: decoding next message    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::codecs: buffer is empty    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_sys::tokio: poll_recv_from called    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker netlink_proto::framed: failed to read from netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: netlink socket stream shut down    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwarding unsolicited messages to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwaring responses to previous requests to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: handling requests    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_requests called    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: sending messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages done    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_flush called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::framed: flushing frame; length=24    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: error flushing netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: done polling Connection    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker libp2p_mdns::behaviour: if watch returned an error: rtnetlink socket closed

I believe the problematic busy-loop is here.
In case of Tokio being shutdown the rnetlink is providing stream of errors, and the loop never terminates.

Version

libp2p version (version number, commit, or branch):
0.50.0

Would you like to work on fixing this bug?

Maybe.

The text was updated successfully, but these errors were encountered:

thomaseizinger · 2023-01-26T23:28:37Z

This sounds like we need to patch if-watch.

From an interface perspective, a stream ends when it returns None. It is not wrong per se to have a stream return Result::Err and continue emitting events later although abstractions like TryStream assume a stream is dead once it returned the first error. See https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect for example.

The implementation however never returns None, see https://github.com/mxinden/if-watch/blob/master/src/linux.rs for example. I think this is the bug, for some reason, we return the error here if the underlying connection is_ready but we don't have any messages.

@mxinden Do you know why that was implemented this way? It looks like there are a few subtle things going on there that aren't immediately clear.

mxinden · 2023-02-24T08:40:53Z

Thanks @michalkucharczyk for reporting.

@mxinden Do you know why that was implemented this way?

I don't unfortunately.

I am not sure what the best next step is.

In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.
In case not all if-watch errors are fatal, we could at least wake our own waker and yield (Poll::Pending) in libp2p-mdns. Though intuitively, that sounds like a hack.

cafedetal · 2023-03-20T20:12:49Z

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

thomaseizinger · 2023-03-21T14:42:59Z

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

Would you be up to patching if-watch the way @mxinden described above and deploy that to see whether it helps?

cafedetal · 2023-03-21T15:02:22Z

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

thomaseizinger · 2023-03-21T18:32:49Z

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

I think compiling for android will result in the fallback code being active. As you can see here, this Stream never ends:

https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L73-L86

The only error can be emitted from here: https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L51 which is defined in https://github.com/messense/if-addrs/blob/947c6342681b047b48b5f53eb75049881d2dfa20/src/lib.rs#L143-L212.

We'd have to audit this function to see whether the errors are permanent or not. If they are, we should be returning None from our Stream implementation after the first error.

cafedetal · 2023-03-22T14:17:43Z

@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

thomaseizinger · 2023-03-22T15:21:32Z

@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

No worries at all, glad we could identify another bug! :)

bkchr · 2023-05-22T08:43:35Z

Is there any update on this?

mxinden · 2023-05-22T10:37:42Z

I am not aware of any updates. I am fine with any of the two solutions suggested in #3390 (comment). @bkchr are you still seeing this in the wild?

bkchr · 2023-05-22T10:40:37Z

Yes, very easy to reproduce. I needed to kill my process today twice with kill -9 (we wait 60 seconds for stuck futures, until we force shutdown the node).

In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.

@michalkucharczyk maybe something you could do?

michalkucharczyk · 2023-05-22T11:39:37Z

@michalkucharczyk maybe something you could do?

I'll put it on my todo-list.

altonen mentioned this issue Apr 6, 2023

Errors "error flushing netlink socket" after upgrade to v0.9.41 paritytech/polkadot-sdk#671

Open

thomaseizinger mentioned this issue Aug 3, 2023

"netlink socket stream shut down" infinite loop #4287

Open

michaelsproul mentioned this issue Mar 21, 2024

error: "rtnetlink socket closed" sigp/lighthouse#5428

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`protocols/mdns` infinite loop? #3390

`protocols/mdns` infinite loop? #3390

michalkucharczyk commented Jan 26, 2023 •

edited

Loading

thomaseizinger commented Jan 26, 2023

mxinden commented Feb 24, 2023

cafedetal commented Mar 20, 2023

thomaseizinger commented Mar 21, 2023

cafedetal commented Mar 21, 2023

thomaseizinger commented Mar 21, 2023

cafedetal commented Mar 22, 2023

thomaseizinger commented Mar 22, 2023

bkchr commented May 22, 2023

mxinden commented May 22, 2023

bkchr commented May 22, 2023

michalkucharczyk commented May 22, 2023

protocols/mdns infinite loop? #3390

protocols/mdns infinite loop? #3390

Comments

michalkucharczyk commented Jan 26, 2023 • edited Loading

Summary

Expected behaviour

Actual behaviour

Version

Would you like to work on fixing this bug?

thomaseizinger commented Jan 26, 2023

mxinden commented Feb 24, 2023

cafedetal commented Mar 20, 2023

thomaseizinger commented Mar 21, 2023

cafedetal commented Mar 21, 2023

thomaseizinger commented Mar 21, 2023

cafedetal commented Mar 22, 2023

thomaseizinger commented Mar 22, 2023

bkchr commented May 22, 2023

mxinden commented May 22, 2023

bkchr commented May 22, 2023

michalkucharczyk commented May 22, 2023

`protocols/mdns` infinite loop? #3390

`protocols/mdns` infinite loop? #3390

michalkucharczyk commented Jan 26, 2023 •

edited

Loading