Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 #12704

Chralt98 · 2022-11-14T16:07:58Z

2022-11-14 16:54:27.169  INFO tokio-runtime-worker sub-libp2p: [Relaychain] 🔍 Discovered new external address for our node: /ip4/...  
2022-11-14 16:54:27.248  INFO tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] 🔍 Discovered new external address for our node: /ip4/...

The above is shown in the 0.9.23 version. This does not happen after the upgrade to 0.9.26. It's worth to mention, that this messages informs us, that there was a new external address found for ip4. After that message, the 0.9.23 version did found peers and sync with them.

We experience client side connection issues after the dependency update to polkadot-v0.9.26 in this PR. We would like to see, that the local node successfully synchronises with other nodes in the network. This correctly happens with the polkadot-v0.9.23 version. I need to mention, that we used Moonbeam dependencies of Substrate and a couple of others before. Now we use the official dependencies.

When I started the node for 0.9.23 (which does sync and found peers) and after that the 0.9.26 runtime, the updated version did sync too. But when the 0.9.26 run alone, it did not sync with the outer parachain nodes.

The following logs are noticeable with the 0.9.26 upgrade:

RUST_LOG=libp2p=trace ./target/debug/zeitgeist --chain=battery_station --tmp --detailed-log-output

2022-11-14 17:16:08.423 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Ignoring self-reported non-global address /ip4/127.0.0.1/tcp/30334/ws from 12D3KooWEbhud9TrHYtzK2K1fpKeC4PjDu7bKnHTn8QYzyrfBJsK.    
2022-11-14 17:16:08.424 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Ignoring self-reported non-global address /ip4/192.168.1.3/tcp/30334/ws from 12D3KooWEbhud9TrHYtzK2K1fpKeC4PjDu7bKnHTn8QYzyrfBJsK.
...

2022-11-14 17:17:30.895 TRACE tokio-runtime-worker sub-libp2p: [Relaychain] Libp2p => Failed to reach PeerId("12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bsr.zeitgeist.pm/tcp/30338/p2p/12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30338/ws/p2p/12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK", Other(Custom { kind: Other, error: Timeout }))].    

...

2022-11-14 17:23:01.013 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Failed to reach PeerId("12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bsr.zeitgeist.pm/tcp/30337/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30337/ws/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout }))].
...

2022-11-14 17:22:55.965 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM => Connect(12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt, SetId(0)): Will start to connect at until Instant { t: 1910374941204 }  

...

2022-11-14 17:10:14.498 DEBUG tokio-runtime-worker libp2p_swarm: [🌔 Zeitgeist Parachain] Connection attempt to PeerId("12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt") failed with Transport([("/dns/bsr.zeitgeist.pm/tcp/30337/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30337/ws/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout }))]).

The text was updated successfully, but these errors were encountered:

bkchr · 2022-11-14T20:19:44Z

CC @altonen @dmitry-markin

altonen · 2022-11-15T07:05:07Z

Do you have any logs you could share from the bootnodes when a polkadot v0.9.26 peer tries to connect? Based on this alone it's hard to say why the connection is rejected.

Chralt98 · 2022-11-16T11:09:46Z

I investigated more time into it. I don't have the bootnode logs for you, but I have the following information.

Look at your test here. On the 0.9.26 version we never get the case, that Adding self-reported address ... to Kademlia DHT /battery_station_v2/kad happens. So I assume, that the new 0.9.26 protocol is unsupported.

substrate/client/network/src/discovery.rs

Lines 316 to 354 in 1cca061

    
           	/// Add a self-reported address of a remote peer to the k-buckets of the supported 
        
           	/// DHTs (`supported_protocols`). 
        
           	/// 
        
           	/// **Note**: It is important that you call this method. The discovery mechanism will not 
        
           	/// automatically add connecting peers to the Kademlia k-buckets. 
        
           	pub fn add_self_reported_address( 
        
           		&mut self, 
        
           		peer_id: &PeerId, 
        
           		supported_protocols: impl Iterator<Item = impl AsRef<[u8]>>, 
        
           		addr: Multiaddr, 
        
           	) { 
        
           		if !self.allow_non_globals_in_dht && !self.can_add_to_dht(&addr) { 
        
           			trace!(target: "sub-libp2p", "Ignoring self-reported non-global address {} from {}.", addr, peer_id); 
        
           			return 
        
           		} 
        
           		let mut added = false; 
        
           		for protocol in supported_protocols { 
        
           			for kademlia in self.kademlias.values_mut() { 
        
           				if protocol.as_ref() == kademlia.protocol_name() { 
        
           					trace!( 
        
           						target: "sub-libp2p", 
        
           						"Adding self-reported address {} from {} to Kademlia DHT {}.", 
        
           						addr, peer_id, String::from_utf8_lossy(kademlia.protocol_name()), 
        
           					); 
        
           					kademlia.add_address(peer_id, addr.clone()); 
        
           					added = true; 
        
           				} 
        
           			} 
        
           		} 
        
           		if !added { 
        
           			trace!( 
        
           				target: "sub-libp2p", 
        
           				"Ignoring self-reported address {} from {} as remote node is not part of any \ 
        
           				 Kademlia DHTs supported by the local node.", addr, peer_id, 
        
           			); 
        
           		} 
        
           	}

We are only getting as remote node is not part of any Kademlia DHTs supported by the local node. or Ignoring self-reported non-global address for the 0.9.26 node.

For the supported protocols we have the following information.

What is interesting here, is that we identify with the Zeitgeist Parachain two types of peers:

First type:

2022-11-16 11:41:20.545 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Identified PeerId("12D3KooWDZq9BhTwew7Xp4mtwaBeN5s2TCm8TYTAqPo8YStFrgkv") => IdentifyInfo { public_key: Ed25519(PublicKey(compressed): 37b588bf639c3ebf5529dd2c6264e47c13315bdeca471066b52de8f014feee97), protocol_version: "/substrate/1.0", agent_version: "Zeitgeist Node/v0.3.6-882feb9d51d (whimsical-trucks-0112)", listen_addrs: ["/ip4/127.0.0.1/tcp/30334/ws", "/ip4/192.168.1.3/tcp/30334/ws", "/ip6/::1/tcp/30334/ws", "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/30334/ws", "/ip6/2a02:3032:e:20aa:534:acf1:ad35:6943/tcp/30334/ws"], protocols: ["/battery_station_relay_v3/block-announces/1", "/battery_station_relay_v3/transactions/1", "/fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1", "/paritytech/grandpa/1", "/polkadot/validation/1", "/polkadot/collation/1", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/battery_station_relay_v3/kad", "/battery_station_relay_v3/sync/2", "/polkadot/req_statement/1", "/polkadot/req_available_data/1", "/battery_station_relay_v3/state/2", "/polkadot/req_chunk/1", "/polkadot/req_pov/1", "/polkadot/send_dispute/1", "/polkadot/req_collation/1", "/battery_station_relay_v3/sync/warp", "/battery_station_relay_v3/light/2"], observed_addr: "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/51415/ws" }

Second type:

2022-11-16 11:43:15.300 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Identified PeerId("12D3KooWPaKTnZxZToLacZ5NR2myRmrHNCAgrsuaSuRYZrtJ8Tim") => IdentifyInfo { public_key: Ed25519(PublicKey(compressed): cc6a6cfed61859aebb1fefbdcef4a81bc6b1aaf564bffe7b26464dd7ea4ebc66), protocol_version: "/substrate/1.0", agent_version: "Zeitgeist Node/v0.3.6-31a123a84eb (abusive-mitten-3176)", listen_addrs: [], protocols: ["/battery_station_v2/block-announces/1", "/battery_station_v2/transactions/1", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/battery_station_v2/kad", "/battery_station_v2/light/2", "/battery_station_v2/sync/2", "/battery_station_v2/state/2"], observed_addr: "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/30333/ws/p2p/12D3KooWFuTZRftstqoDWF4SR69jswSFhvLMhuP1aBSGM9PXUWsq" }

The second one has the matching protocols that we need, but it does not have listen_addrs (So this is empty).

Why could listen_addrs be empty for a peer?

In addition to that I get Random Kademlia query has yielded empty results on the 0.9.26 version.

Chralt98 · 2022-11-16T13:07:35Z

Yeah the protocol negotiation for the Zeitgeist protocols is not happening on the new version. Especially the protocol kad is not negotiated correctly.

Chralt98 · 2022-11-16T13:09:30Z

INFO tokio-runtime-worker beefy: [Relaychain] 🥩 run BEEFY worker, best grandpa: #0.

This above log happens on the old version, but not on the new one.

Chralt98 · 2022-11-16T14:38:58Z

Maybe it has something to do with updating libp2p here and here for polkadot-v0.9.26?

Thanks so much for your support.

altonen · 2022-11-16T17:38:20Z

We did change some DHT-related stuff a while back (such as removing support for multiple DHTs) but I'm not convinced this issue is related to Kademlia. The nodes are discovered through some means and an attempt at establishing a connection is made to both but they timeout so I think the question here is why do they timeout.

If you don't have access to these bootnodes' logs, are you able to reproduce this issue by running the nodes locally and provide us with logs from both the "local bootnode" and local polkadot v0.9.26 node?

Chralt98 · 2022-11-17T07:30:49Z

@altonen Thanks for your quick answer. I currently don't know how to start a local bootnode for v0.9.23 to simulate the behaviour to connect a v0.9.23 bootnode with a new v0.9.26 node and then give you the log output of the v0.9.23 node. Can you please help me out there? Is it this tutorial ?

Apart from that I think this issue seems related to our problem. Maybe it's better to use a newer polkadot version indeed. (I also experience the log WARN async-std/runtime trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender)

altonen · 2022-11-17T11:34:39Z

Yeah, the article you linked should allow you to setup a local network of nodes and test whether this issue is reproducible and if so, see what happens on bootnode side from the logs.

That error message may be related to some response channel getting dropped due to request timeout and may be related to this error you've posted earlier.

... An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR", Other(Custom { kind: Other, error: Timeout })),

Chralt98 · 2022-11-17T16:27:45Z

What I also found interesting was the following:

On the 0.9.26 side parachain node:

2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/beefy/1
2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: confirming protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1    
2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: sent confirmed protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1

...

2022-11-17 12:34:03.203 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /paritytech/beefy/1  

...

2022-11-17 12:34:03.437 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/beefy/1

On the 0.9.23 side bootnode:

2022-11-17 12:32:36.257 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/beefy/1    
2022-11-17 12:32:36.257 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/grandpa/1

...

2022-11-17 12:32:37.233 DEBUG tokio-runtime-worker multistream_select::dialer_select: [Relaychain] Dialer: Received rejection of protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/beefy/1

Chralt98 · 2022-11-18T09:27:14Z

After 1,5 hours the 0.9.26 node finally syncs with other 0.9.23 nodes on the network (even without running a local 0.9.23 node). Do you know, why couldn't we connect earlier? Thanks a lot for your answers here.

EDIT: I tried the same setup and after two and a half hours I couldn't connect. So unfortunately the issue remains.

vivekvpandya · 2022-12-01T06:36:46Z

Just to note this issue still continues with upgrade to 0.9.29 .

vivekvpandya · 2023-01-17T09:57:38Z

@bkchr I want to investigate on following

But the thing is for relay chain it shows ` [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.xx.xx.53/tcp/7676/ws/p2p/12D3KooWLhkaoFSGV2JfXea1BGTEAv2R5PE4v2b1nHZMo1ZGtxxx ` And but for parachain it shows nothing like this

I see that in client/network/src/discovery.rs it never gets to execute inject_new_external_addr()
Can someone from substrate team help me if ip4 interface gets created or not for para-chain on local node?

bkchr · 2023-01-17T12:47:16Z

Cc @melekes @dmitry-markin can you help here please?

melekes · 2023-01-18T04:54:17Z

Why could listen_addrs be empty for a peer?

I'd guess if listen_addresses in network config are empty. Otherwise, look for Can't listen on XXX because warning in logs.

melekes · 2023-01-18T05:05:12Z

debug_logs.zip

Thanks @altonen , that you look into it.

In debug_0.9.26_node.txt:

2022-11-17 13:28:11.310  WARN main sub-libp2p: [Relaychain] Can't listen on /ip4/0.0.0.0/tcp/30334/ws because: Other(Custom { kind: Other, error: Other(A(A(A(
A(Transport(A(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" })))))))) })

This means another process (instance of polkadot) is already listening on 30334. Were you trying to spawn two processes on the same machine?

vivekvpandya · 2023-01-18T06:27:15Z

I will get you a fresh logs, as Relaychain anyway works fine. So above error seems the case as you pointed, tried to spawn two nodes on same machine.

vivekvpandya · 2023-01-18T07:35:13Z

@melekes
so I am getting following things (changed logs slightly to help myself
)

2023-01-18 13:02:46 [Relaychain] Listen Address: "/ip6/::/tcp/30334/ws"    
2023-01-18 13:02:46 [Relaychain] Listen Address: "/ip4/0.0.0.0/tcp/30334/ws"  
..
38:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] 🏷  Local node identity is: 12D3KooWRpKYJKRDhAXMMrqZWMRBHLUmn92dGppkAG1tfBW3oY6w    
39:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] Listen Address: "/ip6/::/tcp/30333"    
40:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] Listen Address: "/ip4/0.0.0.0/tcp/30333"

However I don't see Can't Listen warning
here is changed part of code

     6         // Listen on multiaddresses.                                                                                                                                                                       
     5         for addr in &params.network_config.listen_addresses {                                                                                                                                              
+    4             info!(target: "sub-libp2p", "Listen Address: {:?}", addr);                                                                                                                                     
     3             if let Err(err) = Swarm::<Behaviour<B, Client>>::listen_on(&mut swarm, addr.clone()) {                                                                                                         ~    2                 info!(target: "sub-libp2p", "Can't listen on {} because: {:?}", addr, err)                                                                                                                 
     1             }                                                                                                                                                                                              
  435          }

melekes · 2023-01-18T07:38:52Z

However I don't see Can't Listen warning

That's good, right?

melekes · 2023-01-18T07:39:51Z

But the thing is for relay chain it shows [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.xx.xx.53/tcp/7676/ws/p2p/12D3KooWLhkaoFSGV2JfXea1BGTEAv2R5PE4v2b1nHZMo1ZGtxxx And but for parachain it shows nothing like this

Could you please post debug logs of the parachain node here?

Chralt98 · 2023-01-18T07:41:03Z

@bkchr libp2p/rust-libp2p#2441 (review) your review comment is the cause of the problem here? related to this libp2p/rust-libp2p#3205 (comment)

vivekvpandya · 2023-01-18T08:04:07Z

log_parachain.txt
@melekes let me know if you need more detailed log or any specific thing you were looking for is not included.

melekes · 2023-01-18T08:14:02Z

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.

To summarise the original issue: https://github.com/zeitgeistpm/zeitgeist parachain node can't connect to the two bootnodes due to timeout.

vivekvpandya · 2023-01-18T08:46:02Z

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.

To summarise the original issue: after upgrading from Polkadot v0.9.23 to v0.9.26 https://github.com/zeitgeistpm/zeitgeist parachain node no longer can connect to the two bootnodes due to timeout.

I had get those logs once, but could not see anything which is concerning. Do you have specific thing that should be logged there?
then I can try restarting those nodes with certain flags to enable debugging and get logs.

melekes · 2023-01-18T08:51:50Z

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.
To summarise the original issue: after upgrading from Polkadot v0.9.23 to v0.9.26 https://github.com/zeitgeistpm/zeitgeist parachain node no longer can connect to the two bootnodes due to timeout.

I had get those logs once, but could not see anything which is concerning. Do you have specific thing that should be logged there? then I can try restarting those nodes with certain flags to enable debugging and get logs.

Well, for starters, do they receive a connection request from 12D3KooWMj1VrY7319iTfy7qhKxdSz3HEWEC2bvuNyNYdAGEQGdt (parachain node)? If yes, why it takes them longer than 20s to respond. If no, we'd have to go down to the networking layer (using Wire) and check that a) parachain node sends connection request b) bootnodes receive it.

vivekvpandya · 2023-01-18T09:45:27Z

@melekes I spawned two nodes on two different systems which are on same network via router.
They have same code built from main branch, however they are not able to find each other as peer.
The command I used:
./target/debug/zeitgeist --dev --discover-local

@melekes also are you on Discord or any other IM service? Then we can work together faster.
I am online on discord with user id simple_user#7550

bkchr · 2023-01-19T08:42:18Z

@melekes I spawned two nodes on two different systems which are on same network via router.
They have same code built from main branch, however they are not able to find each other as peer.
The command I used:
./target/debug/zeitgeist --dev --discover-local

If the nodes running on two different systems, please specify one node as bootnode of the other node. We have mdns, but not as an ultra polished feature because you don't need this in real networks.

vivekvpandya · 2023-01-20T12:09:01Z

So I did a small experiment with local network where 192.168.29.90 has node running version v0.9.23 (which sync fine with bootnodes)
and other system 192.168.29.161 I run node (main branch which has v0.9.26) with following two different commands

working case
./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
not working case
./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

attached logs for both.
failing_para_log.txt
working_para_log.txt

melekes · 2023-01-26T04:51:39Z

/ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
/ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

why did you add /ws protocol in the first case and did not in the second case?

vivekvpandya · 2023-01-26T05:33:56Z

/ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
/ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

why did you add /ws protocol in the first case and did not in the second case?

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

melekes · 2023-01-26T05:37:46Z

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

So when you remove /ws from addr, it does not work w/ 0.9.26 (master) BUT works with 0.9.23
with /ws it works in both cases

correct?

vivekvpandya · 2023-01-26T05:39:39Z

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

So when you remove /ws from addr, it does not work w/ 0.9.26 (master) BUT works with 0.9.23 with /ws it works in both cases

correct?

Correct but I have never tried/ws with 0.9.23 as it works without it.

melekes · 2023-01-26T05:50:44Z

I've just tried to run zeitgeist with both 0.9.23 and 0.9.26 using the following cmd: ./target/debug/zeitgeist -l sub-libp2p=trace --bootnodes /dns/bootnode0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR

In both cases I get timeout errors.

vivekvpandya · 2023-01-26T05:52:31Z

I've just tried to run zeitgeist with both 0.9.23 and 0.9.26 using the following cmd: ./target/debug/zeitgeist -l sub-libp2p=trace --bootnodes /dns/bootnode0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR

In both cases I get timeout errors.

Yes my experiment was on local network based setup. With dns based bootnodes even /ws tric does not get working case

vivekvpandya · 2023-01-26T12:05:23Z

On LAN a node at 192.168.29.90 runs zeitgiest node with polkadot version 0.9.23 (which last best working)
Now from other computer I start a node with following commands
For not working case
./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/30333/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNg -l sub-libp2p=trace

Logs are at for not working case: https://drive.google.com/file/d/1xdvskcLN1h9W4ixu27M58AnCPGQcDu7H/view?usp=sharing

./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/30333/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNg -l sub-libp2p=trace
Logs for working case: https://drive.google.com/file/d/13KRhsejykotXAwjxtMrjqEv73ik-8sZm/view?usp=sharing

The .zip file for logs also contain logs from node on 192.168.29.90

dmitry-markin · 2023-01-26T14:44:13Z

1. working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
2. not working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

This is expected behavior, because the node listens on /ws endpoint by default, and on bare tcp only when run with --dev or --validator. See the command line option help and the code.

vivekvpandya · 2023-01-27T09:50:52Z

1. working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
2. not working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
This is expected behavior, because the node listens on /ws endpoint by default, and on bare tcp only when run with --dev or --validator. See the command line option help and the code.

Agreed but with our public bootndes I am not able to connect with /ws , or using --dev and not using /ws in bootnode address.

vivekvpandya · 2023-01-27T09:57:45Z

I have captured packets for following commands
here 172.105.158.248 is external address for bootnode1-zeitgeist.pm
./target/debug/zeitgeist --bootnodes /ip4/172.105.158.248/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7

And a working case where I am running node on LAN address 192.168.29.90

I see RST is being send to 172.105.158.248 and connection breaks and this goes in loop. However in working case I see RST is never sent to 192.168.29.90 and it successfully works.
Not sure reason behind RST, is local settings or something from code?
Instead of sending RST it should have sent HTTP/ GET. (as seen in working case)
Here are wireshare packet_captures https://drive.google.com/file/d/1t3_-q0RHXeeLQyeE6-VQOCpcfTuNi_Vp/view?usp=sharing

melekes · 2023-01-30T06:45:23Z

So in the failing case we're terminating conn with a timeout (and send RST). In the working case we're successfully establishing the connection. Hmm 🤔

vivekvpandya · 2023-01-30T11:06:41Z

Reverse Experiment

on 192.168.29.90 I started node build with TOT main branch.

And from other machine 192.168.29.161 I start old working version (0.3.6) of zeitgeist node with following command:

~/Downloads/zeitgeist_parachain --reserved-only --reserved-nodes /ip4/192.168.29.90/tcp/30333/ws/p2p/12D3KooWA9wshwxeNyvPdRN9Y2QnaPQNkprbQLGFbSTe3XypqJek

I see both nodes find each other but they could not finalize any blocks.
For both nodes I see following log
2023-01-30 16:32:32 [🔮 Zeitgeist Parachain] ⚙️ Syncing 139.9 bps, target=#325423 (1 peers), best: #45977 (0xb31f…3a66), finalized #0 (0x1bf2…e060), ⬇ 437.2kiB/s ⬆ 0.6kiB/s

melekes · 2023-02-06T07:37:56Z

@vivekvpandya are you sure your bootnode is reachable? for example I can't ping it

PING 172.105.158.248 (172.105.158.248): 56 data bytes                          │
Request timeout for icmp_seq 0                                                 │
Request timeout for icmp_seq 1                                                 │
Request timeout for icmp_seq 2                                                 │
Request timeout for icmp_seq 3                                                 │
Request timeout for icmp_seq 4                                                 │
Request timeout for icmp_seq 5

melekes · 2023-02-06T07:45:33Z

I see both nodes find each other but they could not finalize any blocks.

are there any reasons you could see in the log as to why?

github-actions bot added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Nov 14, 2022

This comment was marked as off-topic.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as duplicate.

Sign in to view

This comment was marked as resolved.

Sign in to view

vivekvpandya mentioned this issue Dec 7, 2022

Need help debugging potential libp2p issue libp2p/rust-libp2p#3205

Closed

melekes changed the title ~~Client side connection issues from polkadot v0.9.23 to v0.9.26~~ Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 Jan 18, 2023

melekes self-assigned this Jan 18, 2023

melekes mentioned this issue Jan 18, 2023

Fix client connection issue introduced after upgrade to polkadot-v0.9.26. zeitgeistpm/zeitgeist#883

Closed

Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 #12704

Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 #12704

Comments

Chralt98 commented Nov 14, 2022 • edited Loading

bkchr commented Nov 14, 2022

altonen commented Nov 15, 2022

This comment was marked as off-topic.

This comment was marked as off-topic.

Chralt98 commented Nov 16, 2022 • edited Loading

This comment was marked as outdated.

This comment was marked as duplicate.

Chralt98 commented Nov 16, 2022 • edited Loading

Chralt98 commented Nov 16, 2022

Chralt98 commented Nov 16, 2022

altonen commented Nov 16, 2022 • edited Loading

Chralt98 commented Nov 17, 2022 • edited Loading

altonen commented Nov 17, 2022

This comment was marked as resolved.

Chralt98 commented Nov 17, 2022

Chralt98 commented Nov 18, 2022 • edited Loading

vivekvpandya commented Dec 1, 2022

vivekvpandya commented Jan 17, 2023

bkchr commented Jan 17, 2023

melekes commented Jan 18, 2023

melekes commented Jan 18, 2023 • edited Loading

vivekvpandya commented Jan 18, 2023

vivekvpandya commented Jan 18, 2023

melekes commented Jan 18, 2023 • edited Loading

melekes commented Jan 18, 2023 • edited Loading

Chralt98 commented Jan 18, 2023 • edited Loading

vivekvpandya commented Jan 18, 2023

melekes commented Jan 18, 2023 • edited Loading

vivekvpandya commented Jan 18, 2023

melekes commented Jan 18, 2023

vivekvpandya commented Jan 18, 2023 • edited Loading

bkchr commented Jan 19, 2023

vivekvpandya commented Jan 20, 2023 • edited Loading

melekes commented Jan 26, 2023

vivekvpandya commented Jan 26, 2023

melekes commented Jan 26, 2023

vivekvpandya commented Jan 26, 2023

melekes commented Jan 26, 2023

vivekvpandya commented Jan 26, 2023

vivekvpandya commented Jan 26, 2023

dmitry-markin commented Jan 26, 2023

vivekvpandya commented Jan 27, 2023

vivekvpandya commented Jan 27, 2023 • edited Loading

melekes commented Jan 30, 2023

vivekvpandya commented Jan 30, 2023

melekes commented Feb 6, 2023

melekes commented Feb 6, 2023

Chralt98 commented Nov 14, 2022 •

edited

Loading

Chralt98 commented Nov 16, 2022 •

edited

Loading

Chralt98 commented Nov 16, 2022 •

edited

Loading

altonen commented Nov 16, 2022 •

edited

Loading

Chralt98 commented Nov 17, 2022 •

edited

Loading

Chralt98 commented Nov 18, 2022 •

edited

Loading

melekes commented Jan 18, 2023 •

edited

Loading

melekes commented Jan 18, 2023 •

edited

Loading

melekes commented Jan 18, 2023 •

edited

Loading

Chralt98 commented Jan 18, 2023 •

edited

Loading

melekes commented Jan 18, 2023 •

edited

Loading

vivekvpandya commented Jan 18, 2023 •

edited

Loading

vivekvpandya commented Jan 20, 2023 •

edited

Loading

vivekvpandya commented Jan 27, 2023 •

edited

Loading