Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocols/mdns infinite loop? #3390

Open
michalkucharczyk opened this issue Jan 26, 2023 · 12 comments
Open

protocols/mdns infinite loop? #3390

michalkucharczyk opened this issue Jan 26, 2023 · 12 comments

Comments

@michalkucharczyk
Copy link

michalkucharczyk commented Jan 26, 2023

Summary

It seems there is a busy loop, which never terminates when the process is being shutdown. I observe the problem in polkadot node.

Expected behaviour

The loop should end in case of erroneous conditions.

Actual behaviour

With RUST_LOG=trace, I see logs flooding the terminal. Parts of log are here. (However the problem is not strictly related to this PR, it can be reproduced by Ctrl-C during startup.)
The single log iteration is as follows:

2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling Connection    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: reading incoming messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_read_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: polling socket    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::codecs: NetlinkCodec: decoding next message    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::codecs: buffer is empty    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_sys::tokio: poll_recv_from called    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker netlink_proto::framed: failed to read from netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: netlink socket stream shut down    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwarding unsolicited messages to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_unsolicited_messages done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: forwaring responses to previous requests to the connection handle    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: forward_responses done    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: handling requests    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_requests called    
2023-01-25 12:51:28.215 DEBUG tokio-runtime-worker netlink_proto::connection: sending messages    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_send_messages done    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: poll_flush called    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::framed: flushing frame; length=24    
2023-01-25 12:51:28.215  WARN tokio-runtime-worker netlink_proto::connection: error flushing netlink socket: Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }    
2023-01-25 12:51:28.215 TRACE tokio-runtime-worker netlink_proto::connection: done polling Connection    
2023-01-25 12:51:28.215 ERROR tokio-runtime-worker libp2p_mdns::behaviour: if watch returned an error: rtnetlink socket closed    

I believe the problematic busy-loop is here.
In case of Tokio being shutdown the rnetlink is providing stream of errors, and the loop never terminates.

Version

  • libp2p version (version number, commit, or branch):
    0.50.0

Would you like to work on fixing this bug?

Maybe.

@thomaseizinger
Copy link
Contributor

This sounds like we need to patch if-watch.

From an interface perspective, a stream ends when it returns None. It is not wrong per se to have a stream return Result::Err and continue emitting events later although abstractions like TryStream assume a stream is dead once it returned the first error. See https://docs.rs/futures/latest/futures/stream/trait.TryStreamExt.html#method.try_collect for example.

The implementation however never returns None, see https://github.com/mxinden/if-watch/blob/master/src/linux.rs for example. I think this is the bug, for some reason, we return the error here if the underlying connection is_ready but we don't have any messages.

@mxinden Do you know why that was implemented this way? It looks like there are a few subtle things going on there that aren't immediately clear.

@mxinden
Copy link
Member

mxinden commented Feb 24, 2023

Thanks @michalkucharczyk for reporting.

@mxinden Do you know why that was implemented this way?

I don't unfortunately.

I am not sure what the best next step is.

  • In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.
  • In case not all if-watch errors are fatal, we could at least wake our own waker and yield (Poll::Pending) in libp2p-mdns. Though intuitively, that sounds like a hack.

@cafedetal
Copy link

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

@thomaseizinger
Copy link
Contributor

Could this be in any way related to an issue I'm seeing on Android where interfaces seem to be dropped/readded automatically on a loop? Basically a SwarmEvent::ExpiredListenAddr followed by a SwarmEvent::NewListenAddr for the same addresses, on a loop, every couple of seconds - but only for local addresses.

Would you be up to patching if-watch the way @mxinden described above and deploy that to see whether it helps?

@cafedetal
Copy link

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

@thomaseizinger
Copy link
Contributor

I haven't done any deep dive into the if-watch code to understand the issue fully and patch easily but I can certainly try. A couple pointers would be appreciated, but not essential.

I think compiling for android will result in the fallback code being active. As you can see here, this Stream never ends:

https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L73-L86

The only error can be emitted from here: https://github.com/mxinden/if-watch/blob/31c4780d803431914aab95ec9139aa8e5795f80c/src/fallback.rs#L51 which is defined in https://github.com/messense/if-addrs/blob/947c6342681b047b48b5f53eb75049881d2dfa20/src/lib.rs#L143-L212.

We'd have to audit this function to see whether the errors are permanent or not. If they are, we should be returning None from our Stream implementation after the first error.

@cafedetal
Copy link

@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

@thomaseizinger
Copy link
Contributor

@thomaseizinger turns out the issue I'm seeing on android is unrelated (mxinden/if-watch#30), but I would have taken a way longer time finding it if you hadn't pointed me in the right direction - many thanks for that, and sorry for inadvertedly hijacking this issue.

No worries at all, glad we could identify another bug! :)

@bkchr
Copy link

bkchr commented May 22, 2023

Is there any update on this?

@mxinden
Copy link
Member

mxinden commented May 22, 2023

I am not aware of any updates. I am fine with any of the two solutions suggested in #3390 (comment). @bkchr are you still seeing this in the wild?

@bkchr
Copy link

bkchr commented May 22, 2023

Yes, very easy to reproduce. I needed to kill my process today twice with kill -9 (we wait 60 seconds for stuck futures, until we force shutdown the node).

  • In case all if-watch errors are fatal, I would suggest returning None after the first Error in if-watch and terminating in libp2p-mdns on the first if-watch error.

@michalkucharczyk maybe something you could do?

@michalkucharczyk
Copy link
Author

@michalkucharczyk maybe something you could do?

I'll put it on my todo-list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants