Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need help debugging potential libp2p issue #3205

Closed
vivekvpandya opened this issue Dec 7, 2022 · 22 comments
Closed

Need help debugging potential libp2p issue #3205

vivekvpandya opened this issue Dec 7, 2022 · 22 comments
Labels
need/author-input Needs input from the original author

Comments

@vivekvpandya
Copy link

Summary

Hello, libp2p community I work on substrate based project and when I am trying to upgrade to substrate version from 0.9.23 to 0.9.26 or 0.9.29 the substrate node does not connect to parachain peer to sync data. Is there any issue with libp2p version for backward compatibility? as production deployed version uses 0.9.23 substrate and libp2p version must be old according to that.
Here is relevant issue on substrate repo Client side connection issues from polkadot v0.9.23 to v0.9.26 · Issue #12704 · paritytech/substrate · GitHub
However I need more help understanding debug log generated when executing the node.
Here is complete log filtered for “Parachain” gist:352ae47d5c79aef149033aea71cbd9a7 · GitHub

And below is last few relevant lines to quick glance. Kindly please help what is going wrong.

022-12-01 11:29:56.055 DEBUG tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p <= Starting random Kademlia request for PeerId("1AjEYGcGPLv9mwn3viLg3KZg8zaWRL9Du65G46RKukotq4")    
2022-12-01 11:29:56.057 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Addresses of PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR"): ["/dns/bootnode-0.zeitgeist.pm/tcp/30333", "/dns/bootnode-0.zeitgeist.pm/tcp/30333/ws", "/dns/bootnode-0.zeitgeist.pm/tcp/30333", "/dns/bootnode-0.zeitgeist.pm/tcp/30333/ws"]    
2022-12-01 11:29:56.058 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Dialing(PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR"))    
2022-12-01 11:29:56.058 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Addresses of PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7"): ["/dns/bootnode-1.zeitgeist.pm/tcp/30333", "/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws", "/dns/bootnode-1.zeitgeist.pm/tcp/30333", "/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws"]    
2022-12-01 11:29:56.059 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Dialing(PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7"))    
2022-12-01 11:29:56.069 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Reach failure for Some(PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7")) through "/dns/bootnode-1.zeitgeist.pm/tcp/30333/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7": Other(Custom { kind: Other, error: Timeout })    
2022-12-01 11:29:56.070 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Reach failure for Some(PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7")) through "/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7": Other(Custom { kind: Other, error: Timeout })    
2022-12-01 11:29:56.070 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Dial failure for PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7")    
2022-12-01 11:29:56.070 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM <= Dropped(12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7, SetId(0))    
2022-12-01 11:29:56.070 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Failed to reach PeerId("12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bootnode-1.zeitgeist.pm/tcp/30333/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7", Other(Custom { kind: Other, error: Timeout })), ("/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7", Other(Custom { kind: Other, error: Timeout }))].    
2022-12-01 11:29:56.071 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM => Connect(12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7, SetId(0)): Will start to connect at until Instant { tv_sec: 206462, tv_nsec: 344932773 }    
2022-12-01 11:29:56.071 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Reach failure for Some(PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR")) through "/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR": Other(Custom { kind: Other, error: Timeout })    
2022-12-01 11:29:56.071 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Reach failure for Some(PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR")) through "/dns/bootnode-0.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR": Other(Custom { kind: Other, error: Timeout })    
2022-12-01 11:29:56.072 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Dial failure for PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR")    
2022-12-01 11:29:56.072 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM <= Dropped(12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR, SetId(0))    
2022-12-01 11:29:56.072 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Failed to reach PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR", Other(Custom { kind: Other, error: Timeout })), ("/dns/bootnode-0.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR", Other(Custom { kind: Other, error: Timeout }))].    
2022-12-01 11:29:56.072 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM => Connect(12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR, SetId(0)): Will start to connect at until Instant { tv_sec: 206462, tv_nsec: 346486503 }    
2022-12-01 11:29:56.073 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Query for 0020bf77554404ab51602c16cd1cc1303e818b93ed0d3ec7fd092948018ead25ddef yielded 0 results    
2022-12-01 11:29:58.104  INFO tokio-runtime-worker substrate: [🌔 Zeitgeist Parachain] 💤 Idle (0 peers), best: #0 (0x1bf2…e060), finalized #0 (0x1bf2…e060), ⬇ 0 ⬆ 0

Expected behaviour

Expected behaviour is that client connects fine with old version node which uses substrate 0.9.23.

Actual behaviour

As shown on logs above it gives timeout error.

Detailed Log

Here is complete log filtered for “Parachain” gist:352ae47d5c79aef149033aea71cbd9a7 · GitHub

Version

  • libp2p version (version number, commit, or branch):
    0.46.1

Would you like to work on fixing this bug?

No

@thomaseizinger
Copy link
Contributor

Without having the logs of the node you are connecting to, it is hard to say why the connection times out. Do you know what libp2p version the remote node is running?

@vivekvpandya
Copy link
Author

vivekvpandya commented Dec 8, 2022

Without having the logs of the node you are connecting to, it is hard to say why the connection times out.

@thomaseizinger could please provide exact command to get logs you need? I will try get it from remote node.

Do you know what libp2p version the remote node is running?

Based on https://github.com/paritytech/substrate/blob/polkadot-v0.9.23/client/network/Cargo.toml it must be 0.44.0

@Chralt98
Copy link

Chralt98 commented Dec 8, 2022

The jump was from libp2p version 0.44.0 to 0.46.1. I think this issue is helpful for you @thomaseizinger paritytech/substrate#12704 (comment)

I created a local bootnode for polkadot version 0.9.23 (with libp2p 0.44.0) there AND a local client node with polkadot version 0.9.26 (with libp2p 0.46.1) and couldn't connect. I used debug level, so feel free to reach out when you need trace level.

@vivekvpandya
Copy link
Author

bootnode.log

@vivekvpandya
Copy link
Author

Above attached node seems to be from bootnode-2, following two errors are interesring in the log

  2 Dec 08 05:49:41 zeitgeist-main-bootnode-2 zeitgeist[73543]: 2022-12-08 05:49:41 [🌔 Zeitgeist Parachain] ❌ Error while dialing /dns/telemetry.zeitgeist.pm/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Other(A(Handshake("server rejected handshake; status code = 502"))) }
    1 Dec 08 05:49:41 zeitgeist-main-bootnode-2 zeitgeist[73543]: 2022-12-08 05:49:41 [Relaychain] 💤 Idle (13 peers), best: #15664544 (0x261f…644f), finalized #15664541 (0x06f5…690d), ⬇ 216.5kiB/s ⬆ 354.7kiB/s
9     Dec 08 05:49:41 zeitgeist-main-bootnode-2 zeitgeist[73543]: 2022-12-08 05:49:41 [🌔 Zeitgeist Parachain] 💔 The bootnode you want to connect provided a different peer ID than the one you expect: `12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7` with `12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7`:`D      ialer { address: "/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7", role_override: Dialer }`.
[3:40 PM]

seems like bootnode-1 has different PeerID then expected

I am trying to get logs from other bootnodes too.

@thomaseizinger
Copy link
Contributor

seems like bootnode-1 has different PeerID then expected

Both of these PeerIds are the same though 😄
That sounds like either a bug in the log message or in the comparison that comes before that! I'll have a look but initially, nothing comes to my mind for that changeset that would affect this.

@vivekvpandya
Copy link
Author

full_parachain.zip
Attached zipped full log
of local client which has ID 12D3KooWAd5NrFfuCWYr2hqhpJqKWoQb7cbNJB6xh3u9ikmHvxNb

log_bootnode-1.txt
also attaching full log from bootnode which has ID 12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR.

In bootnode log I don't see anything related to 12D3KooWAd5NrFfuCWYr2hqhpJqKWoQb7cbNJB6xh3u9ikmHvxNb
which makes me wonder if message from my local client never reaches bootnode.

@thomaseizinger could you please shed some light?

@thomaseizinger
Copy link
Contributor

So the first thing that is suspicious in full_parachain.txt is:

2022-12-09 13:30:01.508 DEBUG main peerset: [🌔 Zeitgeist Parachain] Duplicate bootnode in config: PeerId("12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR")

In bootnode log I don't see anything related to 12D3KooWAd5NrFfuCWYr2hqhpJqKWoQb7cbNJB6xh3u9ikmHvxNb
which makes me wonder if message from my local client never reaches bootnode.

Can you show the code of the bootnode? My suspicion is that not everything that would be interesting is logged there.

I also tried to connect to the bootnode via https://github.com/mxinden/libp2p-lookup and it also fails with a timeout:

❯ RUST_LOG=debug cargo run -- direct --address=/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR
    Finished dev [unoptimized + debuginfo] target(s) in 0.18s
     Running `target/debug/libp2p-lookup direct --address=/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR`
[2022-12-20T05:09:20Z DEBUG libp2p_dns] Dialing /ip4/172.105.158.248/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR
[2022-12-20T05:09:20Z DEBUG libp2p_tcp] dialing 172.105.158.248:30333
[2022-12-20T05:09:21Z DEBUG multistream_select::dialer_select] Dialer: Proposed protocol: /noise
[2022-12-20T05:09:40Z ERROR libp2p_lookup] Lookup failed: Timeout.

This makes me think that something is definitely wrong with the configuration of the bootnode.

@vivekvpandya
Copy link
Author

vivekvpandya commented Dec 20, 2022

Thanks @thomaseizinger for spending time one this!

Can you show the code of the bootnode? My suspicion is that not everything that would be interesting is logged there.

In substrate libp2p is used at https://github.com/paritytech/substrate/tree/polkadot-v0.9.23/client/network
And above is the specific version which is being used in bootnode.

This makes me think that something is definitely wrong with the configuration of the bootnode.

Please let me know if you any experiment in your mind, I will collect data/logs etc for our analysis.

@vivekvpandya
Copy link
Author

One thing also makes me curious is following

2022-12-20 14:34:49 [🌔 Zeitgeist Parachain] 🔍 Discovered new external address for our node: /ip6/2405:201:202e:9031:73e4:f247:95b2:467e/tcp/30333/ws/p2p/12D3KooWBMH5hPmRRZaYkrQDH5jRwYRybaZxUzRxEzEKVZWMFJga    
2022-12-20 14:34:49 [Relaychain] 🔍 Discovered new external address for our node: /ip6/2405:201:202e:9031:73e4:f247:95b2:467e/tcp/30334/ws/p2p/12D3KooWF5YZDRDBELL7AiWa96m78erELrRY2F6foNrcVE4MVkEQ    
2022-12-20 14:34:49 [Relaychain] 🔍 Discovered new external address for our node: /ip4/49.36.71.42/tcp/30334/ws/p2p/12D3KooWF5YZDRDBELL7AiWa96m78erELrRY2F6foNrcVE4MVkEQ   

Here for relay chain I can see both ip4 and ip6 addresses but for para-chain only ip6 , why ?

@vivekvpandya
Copy link
Author

Update:
with following parameters it syncs

--tmp --chain=battery_station --reserved-only --reserved-nodes /ip4/139.144.177.248/tcp/30337/ws//p2p/12D3KooWRPRTKzddhmbeBNSRuJXiWK6crxeL7StGeVzDCf7hto99

if I remove ws/ from MultiAddress then it does not sync.

@thomaseizinger
Copy link
Contributor

One thing also makes me curious is following

2022-12-20 14:34:49 [🌔 Zeitgeist Parachain] 🔍 Discovered new external address for our node: /ip6/2405:201:202e:9031:73e4:f247:95b2:467e/tcp/30333/ws/p2p/12D3KooWBMH5hPmRRZaYkrQDH5jRwYRybaZxUzRxEzEKVZWMFJga    
2022-12-20 14:34:49 [Relaychain] 🔍 Discovered new external address for our node: /ip6/2405:201:202e:9031:73e4:f247:95b2:467e/tcp/30334/ws/p2p/12D3KooWF5YZDRDBELL7AiWa96m78erELrRY2F6foNrcVE4MVkEQ    
2022-12-20 14:34:49 [Relaychain] 🔍 Discovered new external address for our node: /ip4/49.36.71.42/tcp/30334/ws/p2p/12D3KooWF5YZDRDBELL7AiWa96m78erELrRY2F6foNrcVE4MVkEQ   

Here for relay chain I can see both ip4 and ip6 addresses but for para-chain only ip6 , why ?

I am not at all familiar with that code. Are these two different Swarms or only just a single one?

if I remove ws/ from MultiAddress then it does not sync.

I assume the double slash in there is a typo? That is not a valid address to begin with.

If a node only listens on the websocket interface, then removing that bit from the address will have it fail to connect.

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

The jump was from libp2p version 0.44.0 to 0.46.1. I think this issue is helpful for you @thomaseizinger paritytech/substrate#12704 (comment)

I created a local bootnode for polkadot version 0.9.23 (with libp2p 0.44.0) there AND a local client node with polkadot version 0.9.26 (with libp2p 0.46.1) and couldn't connect. I used debug level, so feel free to reach out when you need trace level.

Responded here: paritytech/substrate#12704 (comment)

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

💔 The bootnode you want to connect provided a different peer ID than the one you expect: 12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7 with 12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7:D ialer { address: "/dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7", role_override: Dialer }

Looking at the libp2p codebase this can only happen in one case: when a node tries to dial itself (via external address) and fails because

if self.local_id == obtained_peer_id {
. This error is not intuitive and should've probably be called smth like LocalPeerId, but here we are. (cc @thomaseizinger)

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

Here for relay chain I can see both ip4 and ip6 addresses but for para-chain only ip6 , why ?

Here where? If you're referring to https://github.com/libp2p/rust-libp2p/files/10193919/full_parachain.zip log, then it contains this line:

2022-12-09 13:30:01.647 DEBUG tokio-runtime-worker libp2p_tcp: [🌔 Zeitgeist Parachain] New listen address: /ip4/172.18.0.1/tcp/30333
2022-12-09 13:30:01.647 DEBUG tokio-runtime-worker libp2p_websocket::framed: [🌔 Zeitgeist Parachain] Listening on /ip4/172.18.0.1/tcp/30333/ws
2022-12-09 13:30:01.647 TRACE             async-io async_io::driver: main_loop: waiting on I/O
2022-12-09 13:30:01.647 DEBUG tokio-runtime-worker libp2p_swarm: [🌔 Zeitgeist Parachain] Listener ListenerId(17562524388022475163); New address: "/ip4/172.18.0.1/tcp/30333/ws"

where the parachain binds to 172.18.0.1:30333.

@Chralt98
Copy link

#2441 (review)

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

I'd suggest to close this issue and move the discussion to paritytech/substrate#12704.

@thomaseizinger I'll let you and the team decide if you want to do anything with #2441 (review).

@mxinden
Copy link
Member

mxinden commented Jan 23, 2023

@thomaseizinger I'll let you and the team decide if you want to do anything with #2441 (review).

I would be in favor of a more elaborate error name. Please open an issue or provide a patch in case you feel strongly about this @melekes.

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Jan 23, 2023

I am also okay with introducing another error variant for this.

melekes added a commit to melekes/rust-libp2p that referenced this issue Jan 24, 2023
to diff between wrong peer ID and local peer ID.

Refs libp2p#3205
mergify bot pushed a commit that referenced this issue Jan 26, 2023
#3377)

Previously, inbound connections that happened to resolve to our own `PeerId` were reported as `WrongPeerId`. With this patch, we now report those in a dedicated `LocalPeerId` error.

Related: #3205.
@thomaseizinger
Copy link
Contributor

Is this still an issue?

@thomaseizinger thomaseizinger added the need/author-input Needs input from the original author label Jun 15, 2023
@github-actions
Copy link

Oops, seems like we needed more information for this issue, please comment with more details or this issue will be closed in 7 days.

@github-actions
Copy link

This issue was closed because it is missing author input.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/author-input Needs input from the original author
Projects
None yet
Development

No branches or pull requests

5 participants