-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 #12704
Comments
Do you have any logs you could share from the bootnodes when a polkadot v0.9.26 peer tries to connect? Based on this alone it's hard to say why the connection is rejected. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I investigated more time into it. I don't have the bootnode logs for you, but I have the following information. Look at your test here. On the 0.9.26 version we never get the case, that substrate/client/network/src/discovery.rs Lines 316 to 354 in 1cca061
We are only getting For the supported protocols we have the following information. What is interesting here, is that we identify with the Zeitgeist Parachain two types of peers: First type:
Second type:
The second one has the matching protocols that we need, but it does not have Why could In addition to that I get |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as duplicate.
This comment was marked as duplicate.
Yeah the protocol negotiation for the Zeitgeist protocols is not happening on the new version. Especially the protocol |
This above log happens on the old version, but not on the new one. |
We did change some DHT-related stuff a while back (such as removing support for multiple DHTs) but I'm not convinced this issue is related to Kademlia. The nodes are discovered through some means and an attempt at establishing a connection is made to both but they timeout so I think the question here is why do they timeout. If you don't have access to these bootnodes' logs, are you able to reproduce this issue by running the nodes locally and provide us with logs from both the "local bootnode" and local polkadot v0.9.26 node? |
@altonen Thanks for your quick answer. I currently don't know how to start a local bootnode for v0.9.23 to simulate the behaviour to connect a v0.9.23 bootnode with a new v0.9.26 node and then give you the log output of the v0.9.23 node. Can you please help me out there? Is it this tutorial ? Apart from that I think this issue seems related to our problem. Maybe it's better to use a newer polkadot version indeed. (I also experience the log |
Yeah, the article you linked should allow you to setup a local network of nodes and test whether this issue is reproducible and if so, see what happens on bootnode side from the logs. That error message may be related to some response channel getting dropped due to request timeout and may be related to this error you've posted earlier.
|
This comment was marked as resolved.
This comment was marked as resolved.
What I also found interesting was the following: On the 0.9.26 side parachain node:
On the 0.9.23 side bootnode:
|
After 1,5 hours the 0.9.26 node finally syncs with other 0.9.23 nodes on the network (even without running a local 0.9.23 node). Do you know, why couldn't we connect earlier? Thanks a lot for your answers here. EDIT: I tried the same setup and after two and a half hours I couldn't connect. So unfortunately the issue remains. |
Just to note this issue still continues with upgrade to 0.9.29 . |
@bkchr I want to investigate on following
I see that in client/network/src/discovery.rs it never gets to execute inject_new_external_addr() |
Cc @melekes @dmitry-markin can you help here please? |
I'd guess if |
In
This means another process (instance of polkadot) is already listening on |
I will get you a fresh logs, as Relaychain anyway works fine. So above error seems the case as you pointed, tried to spawn two nodes on same machine. |
@melekes
However I don't see Can't Listen warning 6 // Listen on multiaddresses.
5 for addr in ¶ms.network_config.listen_addresses {
+ 4 info!(target: "sub-libp2p", "Listen Address: {:?}", addr);
3 if let Err(err) = Swarm::<Behaviour<B, Client>>::listen_on(&mut swarm, addr.clone()) { ~ 2 info!(target: "sub-libp2p", "Can't listen on {} because: {:?}", addr, err)
1 }
435 } |
That's good, right? |
Could you please post debug logs of the parachain node here? |
@bkchr libp2p/rust-libp2p#2441 (review) your review comment is the cause of the problem here? related to this libp2p/rust-libp2p#3205 (comment) |
log_parachain.txt |
Do you happen to have logs of To summarise the original issue: https://github.com/zeitgeistpm/zeitgeist parachain node can't connect to the two bootnodes due to timeout. |
I had get those logs once, but could not see anything which is concerning. Do you have specific thing that should be logged there? |
Well, for starters, do they receive a connection request from |
@melekes I spawned two nodes on two different systems which are on same network via router. @melekes also are you on Discord or any other IM service? Then we can work together faster. |
If the nodes running on two different systems, please specify one node as bootnode of the other node. We have mdns, but not as an ultra polished feature because you don't need this in real networks. |
So I did a small experiment with local network where 192.168.29.90 has node running version v0.9.23 (which sync fine with bootnodes)
attached logs for both. |
why did you add |
That is how I can get a pass and failing case. |
So when you remove correct? |
Correct but I have never tried/ws with 0.9.23 as it works without it. |
I've just tried to run In both cases I get timeout errors. |
Yes my experiment was on local network based setup. With dns based bootnodes even /ws tric does not get working case |
On LAN a node at 192.168.29.90 runs zeitgiest node with polkadot version 0.9.23 (which last best working) Logs are at for not working case: https://drive.google.com/file/d/1xdvskcLN1h9W4ixu27M58AnCPGQcDu7H/view?usp=sharing ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/30333/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNg -l sub-libp2p=trace The .zip file for logs also contain logs from node on 192.168.29.90 |
This is expected behavior, because the node listens on |
Agreed but with our public bootndes I am not able to connect with /ws , or using --dev and not using /ws in bootnode address. |
I have captured packets for following commands And a working case where I am running node on LAN address 192.168.29.90 I see RST is being send to 172.105.158.248 and connection breaks and this goes in loop. However in working case I see RST is never sent to 192.168.29.90 and it successfully works. |
So in the failing case we're terminating conn with a timeout (and send |
Reverse Experiment on 192.168.29.90 I started node build with TOT main branch. And from other machine 192.168.29.161 I start old working version (0.3.6) of zeitgeist node with following command:
I see both nodes find each other but they could not finalize any blocks. |
@vivekvpandya are you sure your bootnode is reachable? for example I can't ping it
|
are there any reasons you could see in the log as to why? |
The above is shown in the 0.9.23 version. This does not happen after the upgrade to 0.9.26. It's worth to mention, that this messages informs us, that there was a new external address found for ip4. After that message, the 0.9.23 version did found peers and sync with them.
We experience client side connection issues after the dependency update to polkadot-v0.9.26 in this PR. We would like to see, that the local node successfully synchronises with other nodes in the network. This correctly happens with the
polkadot-v0.9.23
version. I need to mention, that we used Moonbeam dependencies of Substrate and a couple of others before. Now we use the official dependencies.When I started the node for 0.9.23 (which does sync and found peers) and after that the 0.9.26 runtime, the updated version did sync too. But when the 0.9.26 run alone, it did not sync with the outer parachain nodes.
The following logs are noticeable with the 0.9.26 upgrade:
The text was updated successfully, but these errors were encountered: