Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 #12704

Open
Chralt98 opened this issue Nov 14, 2022 · 47 comments
Open
Assignees
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.

Comments

@Chralt98
Copy link

Chralt98 commented Nov 14, 2022

2022-11-14 16:54:27.169  INFO tokio-runtime-worker sub-libp2p: [Relaychain] 🔍 Discovered new external address for our node: /ip4/...  
2022-11-14 16:54:27.248  INFO tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] 🔍 Discovered new external address for our node: /ip4/...

The above is shown in the 0.9.23 version. This does not happen after the upgrade to 0.9.26. It's worth to mention, that this messages informs us, that there was a new external address found for ip4. After that message, the 0.9.23 version did found peers and sync with them.

We experience client side connection issues after the dependency update to polkadot-v0.9.26 in this PR. We would like to see, that the local node successfully synchronises with other nodes in the network. This correctly happens with the polkadot-v0.9.23 version. I need to mention, that we used Moonbeam dependencies of Substrate and a couple of others before. Now we use the official dependencies.

When I started the node for 0.9.23 (which does sync and found peers) and after that the 0.9.26 runtime, the updated version did sync too. But when the 0.9.26 run alone, it did not sync with the outer parachain nodes.

The following logs are noticeable with the 0.9.26 upgrade:

RUST_LOG=libp2p=trace ./target/debug/zeitgeist --chain=battery_station --tmp --detailed-log-output
2022-11-14 17:16:08.423 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Ignoring self-reported non-global address /ip4/127.0.0.1/tcp/30334/ws from 12D3KooWEbhud9TrHYtzK2K1fpKeC4PjDu7bKnHTn8QYzyrfBJsK.    
2022-11-14 17:16:08.424 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Ignoring self-reported non-global address /ip4/192.168.1.3/tcp/30334/ws from 12D3KooWEbhud9TrHYtzK2K1fpKeC4PjDu7bKnHTn8QYzyrfBJsK.
...

2022-11-14 17:17:30.895 TRACE tokio-runtime-worker sub-libp2p: [Relaychain] Libp2p => Failed to reach PeerId("12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bsr.zeitgeist.pm/tcp/30338/p2p/12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30338/ws/p2p/12D3KooWRuwKV1yt6fqPL6hsfPvJQ28pqmmTxNSDxEC4eLKagzjK", Other(Custom { kind: Other, error: Timeout }))].    

...

2022-11-14 17:23:01.013 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Libp2p => Failed to reach PeerId("12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt"): An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bsr.zeitgeist.pm/tcp/30337/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30337/ws/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout }))].
...

2022-11-14 17:22:55.965 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] PSM => Connect(12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt, SetId(0)): Will start to connect at until Instant { t: 1910374941204 }  

...

2022-11-14 17:10:14.498 DEBUG tokio-runtime-worker libp2p_swarm: [🌔 Zeitgeist Parachain] Connection attempt to PeerId("12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt") failed with Transport([("/dns/bsr.zeitgeist.pm/tcp/30337/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout })), ("/dns/bsr.zeitgeist.pm/tcp/30337/ws/p2p/12D3KooWSBj4SXAz1ETTurW5PRAF1abyxXb9APAVs4vqBVr2NjRt", Other(Custom { kind: Other, error: Timeout }))]).
@github-actions github-actions bot added the J2-unconfirmed Issue might be valid, but it’s not yet known. label Nov 14, 2022
@bkchr
Copy link
Member

bkchr commented Nov 14, 2022

CC @altonen @dmitry-markin

@altonen
Copy link
Contributor

altonen commented Nov 15, 2022

Do you have any logs you could share from the bootnodes when a polkadot v0.9.26 peer tries to connect? Based on this alone it's hard to say why the connection is rejected.

@vivekvpandya

This comment was marked as off-topic.

@vivekvpandya

This comment was marked as off-topic.

@Chralt98
Copy link
Author

Chralt98 commented Nov 16, 2022

I investigated more time into it. I don't have the bootnode logs for you, but I have the following information.

Look at your test here. On the 0.9.26 version we never get the case, that Adding self-reported address ... to Kademlia DHT /battery_station_v2/kad happens. So I assume, that the new 0.9.26 protocol is unsupported.

/// Add a self-reported address of a remote peer to the k-buckets of the supported
/// DHTs (`supported_protocols`).
///
/// **Note**: It is important that you call this method. The discovery mechanism will not
/// automatically add connecting peers to the Kademlia k-buckets.
pub fn add_self_reported_address(
&mut self,
peer_id: &PeerId,
supported_protocols: impl Iterator<Item = impl AsRef<[u8]>>,
addr: Multiaddr,
) {
if !self.allow_non_globals_in_dht && !self.can_add_to_dht(&addr) {
trace!(target: "sub-libp2p", "Ignoring self-reported non-global address {} from {}.", addr, peer_id);
return
}
let mut added = false;
for protocol in supported_protocols {
for kademlia in self.kademlias.values_mut() {
if protocol.as_ref() == kademlia.protocol_name() {
trace!(
target: "sub-libp2p",
"Adding self-reported address {} from {} to Kademlia DHT {}.",
addr, peer_id, String::from_utf8_lossy(kademlia.protocol_name()),
);
kademlia.add_address(peer_id, addr.clone());
added = true;
}
}
}
if !added {
trace!(
target: "sub-libp2p",
"Ignoring self-reported address {} from {} as remote node is not part of any \
Kademlia DHTs supported by the local node.", addr, peer_id,
);
}
}

We are only getting as remote node is not part of any Kademlia DHTs supported by the local node. or Ignoring self-reported non-global address for the 0.9.26 node.

For the supported protocols we have the following information.

What is interesting here, is that we identify with the Zeitgeist Parachain two types of peers:

First type:

2022-11-16 11:41:20.545 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Identified PeerId("12D3KooWDZq9BhTwew7Xp4mtwaBeN5s2TCm8TYTAqPo8YStFrgkv") => IdentifyInfo { public_key: Ed25519(PublicKey(compressed): 37b588bf639c3ebf5529dd2c6264e47c13315bdeca471066b52de8f014feee97), protocol_version: "/substrate/1.0", agent_version: "Zeitgeist Node/v0.3.6-882feb9d51d (whimsical-trucks-0112)", listen_addrs: ["/ip4/127.0.0.1/tcp/30334/ws", "/ip4/192.168.1.3/tcp/30334/ws", "/ip6/::1/tcp/30334/ws", "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/30334/ws", "/ip6/2a02:3032:e:20aa:534:acf1:ad35:6943/tcp/30334/ws"], protocols: ["/battery_station_relay_v3/block-announces/1", "/battery_station_relay_v3/transactions/1", "/fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1", "/paritytech/grandpa/1", "/polkadot/validation/1", "/polkadot/collation/1", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/battery_station_relay_v3/kad", "/battery_station_relay_v3/sync/2", "/polkadot/req_statement/1", "/polkadot/req_available_data/1", "/battery_station_relay_v3/state/2", "/polkadot/req_chunk/1", "/polkadot/req_pov/1", "/polkadot/send_dispute/1", "/polkadot/req_collation/1", "/battery_station_relay_v3/sync/warp", "/battery_station_relay_v3/light/2"], observed_addr: "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/51415/ws" }    

Second type:

2022-11-16 11:43:15.300 TRACE tokio-runtime-worker sub-libp2p: [🌔 Zeitgeist Parachain] Identified PeerId("12D3KooWPaKTnZxZToLacZ5NR2myRmrHNCAgrsuaSuRYZrtJ8Tim") => IdentifyInfo { public_key: Ed25519(PublicKey(compressed): cc6a6cfed61859aebb1fefbdcef4a81bc6b1aaf564bffe7b26464dd7ea4ebc66), protocol_version: "/substrate/1.0", agent_version: "Zeitgeist Node/v0.3.6-31a123a84eb (abusive-mitten-3176)", listen_addrs: [], protocols: ["/battery_station_v2/block-announces/1", "/battery_station_v2/transactions/1", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0", "/ipfs/id/push/1.0.0", "/battery_station_v2/kad", "/battery_station_v2/light/2", "/battery_station_v2/sync/2", "/battery_station_v2/state/2"], observed_addr: "/ip6/2a02:3032:e:20aa:c91:d30b:90b5:c845/tcp/30333/ws/p2p/12D3KooWFuTZRftstqoDWF4SR69jswSFhvLMhuP1aBSGM9PXUWsq" }  

The second one has the matching protocols that we need, but it does not have listen_addrs (So this is empty).

Why could listen_addrs be empty for a peer?

In addition to that I get Random Kademlia query has yielded empty results on the 0.9.26 version.

@vivekvpandya

This comment was marked as outdated.

@vivekvpandya

This comment was marked as duplicate.

@Chralt98
Copy link
Author

Chralt98 commented Nov 16, 2022

Yeah the protocol negotiation for the Zeitgeist protocols is not happening on the new version. Especially the protocol kad is not negotiated correctly.

@Chralt98
Copy link
Author

INFO tokio-runtime-worker beefy: [Relaychain] 🥩 run BEEFY worker, best grandpa: #0.

This above log happens on the old version, but not on the new one.

@Chralt98
Copy link
Author

Maybe it has something to do with updating libp2p here and here for polkadot-v0.9.26?

Thanks so much for your support.

@altonen
Copy link
Contributor

altonen commented Nov 16, 2022

We did change some DHT-related stuff a while back (such as removing support for multiple DHTs) but I'm not convinced this issue is related to Kademlia. The nodes are discovered through some means and an attempt at establishing a connection is made to both but they timeout so I think the question here is why do they timeout.

If you don't have access to these bootnodes' logs, are you able to reproduce this issue by running the nodes locally and provide us with logs from both the "local bootnode" and local polkadot v0.9.26 node?

@Chralt98
Copy link
Author

Chralt98 commented Nov 17, 2022

@altonen Thanks for your quick answer. I currently don't know how to start a local bootnode for v0.9.23 to simulate the behaviour to connect a v0.9.23 bootnode with a new v0.9.26 node and then give you the log output of the v0.9.23 node. Can you please help me out there? Is it this tutorial ?

Apart from that I think this issue seems related to our problem. Maybe it's better to use a newer polkadot version indeed. (I also experience the log WARN async-std/runtime trust_dns_proto::xfer::dns_exchange: failed to associate send_message response to the sender)

@altonen
Copy link
Contributor

altonen commented Nov 17, 2022

Yeah, the article you linked should allow you to setup a local network of nodes and test whether this issue is reproducible and if so, see what happens on bootnode side from the logs.

That error message may be related to some response channel getting dropped due to request timeout and may be related to this error you've posted earlier.

... An error occurred while negotiating the transport protocol(s) on a connection: [("/dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR", Other(Custom { kind: Other, error: Timeout })),

@Chralt98

This comment was marked as resolved.

@Chralt98
Copy link
Author

What I also found interesting was the following:

On the 0.9.26 side parachain node:

2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/beefy/1
2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: confirming protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1    
2022-11-17 12:33:05.965 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: sent confirmed protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/grandpa/1

...

2022-11-17 12:34:03.203 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /paritytech/beefy/1  

...

2022-11-17 12:34:03.437 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/beefy/1 

On the 0.9.23 side bootnode:

2022-11-17 12:32:36.257 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/beefy/1    
2022-11-17 12:32:36.257 DEBUG tokio-runtime-worker multistream_select::listener_select: [Relaychain] Listener: rejecting protocol: /0xfc54…56fc/grandpa/1

...

2022-11-17 12:32:37.233 DEBUG tokio-runtime-worker multistream_select::dialer_select: [Relaychain] Dialer: Received rejection of protocol: /fc54ce994a915e03f2b2b58206a8939b1c4e5fc6e58810517b4d09858fde56fc/beefy/1

@Chralt98
Copy link
Author

Chralt98 commented Nov 18, 2022

After 1,5 hours the 0.9.26 node finally syncs with other 0.9.23 nodes on the network (even without running a local 0.9.23 node). Do you know, why couldn't we connect earlier? Thanks a lot for your answers here.

EDIT: I tried the same setup and after two and a half hours I couldn't connect. So unfortunately the issue remains.

@vivekvpandya
Copy link
Contributor

Just to note this issue still continues with upgrade to 0.9.29 .

@vivekvpandya
Copy link
Contributor

@bkchr I want to investigate on following

But the thing is for relay chain it shows ` [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.xx.xx.53/tcp/7676/ws/p2p/12D3KooWLhkaoFSGV2JfXea1BGTEAv2R5PE4v2b1nHZMo1ZGtxxx ` And but for parachain it shows nothing like this

I see that in client/network/src/discovery.rs it never gets to execute inject_new_external_addr()
Can someone from substrate team help me if ip4 interface gets created or not for para-chain on local node?

@bkchr
Copy link
Member

bkchr commented Jan 17, 2023

Cc @melekes @dmitry-markin can you help here please?

@melekes melekes changed the title Client side connection issues from polkadot v0.9.23 to v0.9.26 Client side connection issues after upgrading Polkadot v0.9.23 to v0.9.26 Jan 18, 2023
@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

Why could listen_addrs be empty for a peer?

I'd guess if listen_addresses in network config are empty. Otherwise, look for Can't listen on XXX because warning in logs.

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

debug_logs.zip

Thanks @altonen , that you look into it.

In debug_0.9.26_node.txt:

2022-11-17 13:28:11.310  WARN main sub-libp2p: [Relaychain] Can't listen on /ip4/0.0.0.0/tcp/30334/ws because: Other(Custom { kind: Other, error: Other(A(A(A(
A(Transport(A(Transport(Os { code: 48, kind: AddrInUse, message: "Address already in use" })))))))) })

This means another process (instance of polkadot) is already listening on 30334. Were you trying to spawn two processes on the same machine?

@vivekvpandya
Copy link
Contributor

I will get you a fresh logs, as Relaychain anyway works fine. So above error seems the case as you pointed, tried to spawn two nodes on same machine.

@melekes melekes self-assigned this Jan 18, 2023
@vivekvpandya
Copy link
Contributor

@melekes
so I am getting following things (changed logs slightly to help myself
)

2023-01-18 13:02:46 [Relaychain] Listen Address: "/ip6/::/tcp/30334/ws"    
2023-01-18 13:02:46 [Relaychain] Listen Address: "/ip4/0.0.0.0/tcp/30334/ws"  
..
38:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] 🏷  Local node identity is: 12D3KooWRpKYJKRDhAXMMrqZWMRBHLUmn92dGppkAG1tfBW3oY6w    
39:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] Listen Address: "/ip6/::/tcp/30333"    
40:2023-01-18 12:56:04 [🌔 Zeitgeist Parachain] Listen Address: "/ip4/0.0.0.0/tcp/30333"   

However I don't see Can't Listen warning
here is changed part of code

     6         // Listen on multiaddresses.                                                                                                                                                                       
     5         for addr in &params.network_config.listen_addresses {                                                                                                                                              
+    4             info!(target: "sub-libp2p", "Listen Address: {:?}", addr);                                                                                                                                     
     3             if let Err(err) = Swarm::<Behaviour<B, Client>>::listen_on(&mut swarm, addr.clone()) {                                                                                                         ~    2                 info!(target: "sub-libp2p", "Can't listen on {} because: {:?}", addr, err)                                                                                                                 
     1             }                                                                                                                                                                                              
  435          }  

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

However I don't see Can't Listen warning

That's good, right?

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

But the thing is for relay chain it shows [Relaychain] 🔍 Discovered new external address for our node: /ip4/3.xx.xx.53/tcp/7676/ws/p2p/12D3KooWLhkaoFSGV2JfXea1BGTEAv2R5PE4v2b1nHZMo1ZGtxxx And but for parachain it shows nothing like this

Could you please post debug logs of the parachain node here?

@Chralt98
Copy link
Author

Chralt98 commented Jan 18, 2023

@bkchr libp2p/rust-libp2p#2441 (review) your review comment is the cause of the problem here? related to this libp2p/rust-libp2p#3205 (comment)

@vivekvpandya
Copy link
Contributor

log_parachain.txt
@melekes let me know if you need more detailed log or any specific thing you were looking for is not included.

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.


To summarise the original issue: https://github.com/zeitgeistpm/zeitgeist parachain node can't connect to the two bootnodes due to timeout.

@vivekvpandya
Copy link
Contributor

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.

To summarise the original issue: after upgrading from Polkadot v0.9.23 to v0.9.26 https://github.com/zeitgeistpm/zeitgeist parachain node no longer can connect to the two bootnodes due to timeout.

I had get those logs once, but could not see anything which is concerning. Do you have specific thing that should be logged there?
then I can try restarting those nodes with certain flags to enable debugging and get logs.

@melekes
Copy link
Contributor

melekes commented Jan 18, 2023

Do you happen to have logs of /dns/bootnode-0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR or /dns/bootnode-1.zeitgeist.pm/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqc ndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7? Looks like we're failing to connect to both due to timeout.
To summarise the original issue: after upgrading from Polkadot v0.9.23 to v0.9.26 https://github.com/zeitgeistpm/zeitgeist parachain node no longer can connect to the two bootnodes due to timeout.

I had get those logs once, but could not see anything which is concerning. Do you have specific thing that should be logged there? then I can try restarting those nodes with certain flags to enable debugging and get logs.

Well, for starters, do they receive a connection request from 12D3KooWMj1VrY7319iTfy7qhKxdSz3HEWEC2bvuNyNYdAGEQGdt (parachain node)? If yes, why it takes them longer than 20s to respond. If no, we'd have to go down to the networking layer (using Wire) and check that a) parachain node sends connection request b) bootnodes receive it.

@vivekvpandya
Copy link
Contributor

vivekvpandya commented Jan 18, 2023

@melekes I spawned two nodes on two different systems which are on same network via router.
They have same code built from main branch, however they are not able to find each other as peer.
The command I used:
./target/debug/zeitgeist --dev --discover-local

@melekes also are you on Discord or any other IM service? Then we can work together faster.
I am online on discord with user id simple_user#7550

@bkchr
Copy link
Member

bkchr commented Jan 19, 2023

@melekes I spawned two nodes on two different systems which are on same network via router.
They have same code built from main branch, however they are not able to find each other as peer.
The command I used:
./target/debug/zeitgeist --dev --discover-local

If the nodes running on two different systems, please specify one node as bootnode of the other node. We have mdns, but not as an ultra polished feature because you don't need this in real networks.

@vivekvpandya
Copy link
Contributor

vivekvpandya commented Jan 20, 2023

So I did a small experiment with local network where 192.168.29.90 has node running version v0.9.23 (which sync fine with bootnodes)
and other system 192.168.29.161 I run node (main branch which has v0.9.26) with following two different commands

  1. working case
    ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

  2. not working case
    ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

attached logs for both.
failing_para_log.txt
working_para_log.txt

@melekes
Copy link
Contributor

melekes commented Jan 26, 2023

/ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
/ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

why did you add /ws protocol in the first case and did not in the second case?

@vivekvpandya
Copy link
Contributor

/ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
/ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

why did you add /ws protocol in the first case and did not in the second case?

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

@melekes
Copy link
Contributor

melekes commented Jan 26, 2023

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

So when you remove /ws from addr, it does not work w/ 0.9.26 (master) BUT works with 0.9.23
with /ws it works in both cases

correct?

@vivekvpandya
Copy link
Contributor

That is how I can get a pass and failing case.
by default there is not /ws added, but when I add it manually it works like it used to work before.

So when you remove /ws from addr, it does not work w/ 0.9.26 (master) BUT works with 0.9.23 with /ws it works in both cases

correct?

Correct but I have never tried/ws with 0.9.23 as it works without it.

@melekes
Copy link
Contributor

melekes commented Jan 26, 2023

I've just tried to run zeitgeist with both 0.9.23 and 0.9.26 using the following cmd: ./target/debug/zeitgeist -l sub-libp2p=trace --bootnodes /dns/bootnode0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR

In both cases I get timeout errors.

@vivekvpandya
Copy link
Contributor

I've just tried to run zeitgeist with both 0.9.23 and 0.9.26 using the following cmd: ./target/debug/zeitgeist -l sub-libp2p=trace --bootnodes /dns/bootnode0.zeitgeist.pm/tcp/30333/p2p/12D3KooWNpDxYCicWvNrR18QvkkAgVvBdsbh2woHfUGmKijj59aR

In both cases I get timeout errors.

Yes my experiment was on local network based setup. With dns based bootnodes even /ws tric does not get working case

@vivekvpandya
Copy link
Contributor

On LAN a node at 192.168.29.90 runs zeitgiest node with polkadot version 0.9.23 (which last best working)
Now from other computer I start a node with following commands
For not working case
./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/30333/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNg -l sub-libp2p=trace

Logs are at for not working case: https://drive.google.com/file/d/1xdvskcLN1h9W4ixu27M58AnCPGQcDu7H/view?usp=sharing

./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/30333/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNg -l sub-libp2p=trace
Logs for working case: https://drive.google.com/file/d/13KRhsejykotXAwjxtMrjqEv73ik-8sZm/view?usp=sharing

The .zip file for logs also contain logs from node on 192.168.29.90

@dmitry-markin
Copy link
Contributor

1. working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
2. not working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

This is expected behavior, because the node listens on /ws endpoint by default, and on bare tcp only when run with --dev or --validator. See the command line option help and the code.

@vivekvpandya
Copy link
Contributor

1. working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/ws/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg
2. not working case
   ./target/debug/zeitgeist --bootnodes /ip4/192.168.29.90/tcp/3033/p2p/12D3KooWNH2kJ6j61mVnpbVW1Tsyt7tNmaRqECaqbCKCPDhLYeNgg

This is expected behavior, because the node listens on /ws endpoint by default, and on bare tcp only when run with --dev or --validator. See the command line option help and the code.

Agreed but with our public bootndes I am not able to connect with /ws , or using --dev and not using /ws in bootnode address.

@vivekvpandya
Copy link
Contributor

vivekvpandya commented Jan 27, 2023

I have captured packets for following commands
here 172.105.158.248 is external address for bootnode1-zeitgeist.pm
./target/debug/zeitgeist --bootnodes /ip4/172.105.158.248/tcp/30333/ws/p2p/12D3KooWGLidEdRqggqcndA3Nsf6KcL3HbW7BdSGww6QNDkNnYV7

And a working case where I am running node on LAN address 192.168.29.90

packet_capture

I see RST is being send to 172.105.158.248 and connection breaks and this goes in loop. However in working case I see RST is never sent to 192.168.29.90 and it successfully works.
Not sure reason behind RST, is local settings or something from code?
Instead of sending RST it should have sent HTTP/ GET. (as seen in working case)
Here are wireshare packet_captures https://drive.google.com/file/d/1t3_-q0RHXeeLQyeE6-VQOCpcfTuNi_Vp/view?usp=sharing

@melekes
Copy link
Contributor

melekes commented Jan 30, 2023

So in the failing case we're terminating conn with a timeout (and send RST). In the working case we're successfully establishing the connection. Hmm 🤔

@vivekvpandya
Copy link
Contributor

Reverse Experiment

on 192.168.29.90 I started node build with TOT main branch.

And from other machine 192.168.29.161 I start old working version (0.3.6) of zeitgeist node with following command:

~/Downloads/zeitgeist_parachain --reserved-only --reserved-nodes /ip4/192.168.29.90/tcp/30333/ws/p2p/12D3KooWA9wshwxeNyvPdRN9Y2QnaPQNkprbQLGFbSTe3XypqJek

I see both nodes find each other but they could not finalize any blocks.
For both nodes I see following log
2023-01-30 16:32:32 [🔮 Zeitgeist Parachain] ⚙️ Syncing 139.9 bps, target=#325423 (1 peers), best: #45977 (0xb31f…3a66), finalized #0 (0x1bf2…e060), ⬇ 437.2kiB/s ⬆ 0.6kiB/s

@melekes
Copy link
Contributor

melekes commented Feb 6, 2023

@vivekvpandya are you sure your bootnode is reachable? for example I can't ping it

PING 172.105.158.248 (172.105.158.248): 56 data bytes                          │
Request timeout for icmp_seq 0                                                 │
Request timeout for icmp_seq 1                                                 │
Request timeout for icmp_seq 2                                                 │
Request timeout for icmp_seq 3                                                 │
Request timeout for icmp_seq 4                                                 │
Request timeout for icmp_seq 5

@melekes
Copy link
Contributor

melekes commented Feb 6, 2023

I see both nodes find each other but they could not finalize any blocks.

are there any reasons you could see in the log as to why?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
J2-unconfirmed Issue might be valid, but it’s not yet known.
Projects
None yet
Development

No branches or pull requests

6 participants