Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

sc_network::service::tests::notifications_back_pressure is flaky #6766

Closed
tomaka opened this issue Jul 30, 2020 · 6 comments · Fixed by #6923
Closed

sc_network::service::tests::notifications_back_pressure is flaky #6766

tomaka opened this issue Jul 30, 2020 · 6 comments · Fixed by #6923

Comments

@tomaka
Copy link
Contributor

tomaka commented Jul 30, 2020

This test seems to fail on CI.

 thread 'service::tests::notifications_back_pressure' panicked at 'assertion failed: `(left == right)`
  left: `b"hello paritytech/substrate#9"`,
 right: `"hello #0"`', client/network/src/service/tests.rs:367:25

It seems that notifications are lost. My guess is that the CPU is busy and the test gets put on hold for more than 30 seconds, which causes the connection to get killed for inactivity.

@tomaka
Copy link
Contributor Author

tomaka commented Jul 30, 2020

I don't know how to debug this, as I can't manage to make it fail locally.

I suspect that it's related to this TODO:

// TODO: Notifications might have been lost as a result of the previous
// connection being dropped, and as a result it would be preferable to notify
// the users of this fact by simulating the substream being closed then
// reopened.
// The code below doesn't compile because `role` is unknown. Propagating the
// handshake of the secondary connections is quite an invasive change and
// would conflict with https://github.com/paritytech/substrate/issues/6403.
// Considering that dropping notifications is generally regarded as
// acceptable, this bug is at the moment intentionally left there and is
// intended to be fixed at the same time as
// https://github.com/paritytech/substrate/issues/6403.

It's also not certain whether the test itself is correct, because of paritytech/polkadot-sdk#552

@romanb
Copy link
Contributor

romanb commented Jul 31, 2020

Could this not also be related to the fact discussed here that notifications may start out being sent over the legacy substream due to timing? That should also imply a potential loss of ordering for the receiver, as there is none across different substreams, each having their own buffers and being polled independently. An indicator for that may also be if the message numbers in the failed assertion are always low, since such reordering for the receiver should only occur at the beginning, as later on all notifications are sent over the same substream.

@tomaka
Copy link
Contributor Author

tomaka commented Jul 31, 2020

That's indeed more likely, I didn't think of this.

Then hopefully #5670 should fix that.

@tomaka
Copy link
Contributor Author

tomaka commented Aug 5, 2020

#6826 fixes a problem we were discarding notifications rather than sending them on the legacy substream. This might fix the test as well.

@romanb
Copy link
Contributor

romanb commented Aug 18, 2020

Is this still an issue?

@tomaka
Copy link
Contributor Author

tomaka commented Aug 18, 2020

#6821 fixes this issue, but introduces another test failure which I haven't investigated yet.

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants