Add changes to fix issues running the Spacewalk client in Kubernetes #519

ebma · 2024-05-02T15:04:04Z

Removes the backoff crate. Assuming that the retry() function introduces some side effects, we manually implement the retry logic now. The existing constants for RETRY_TIMEOUT and RETRY_INTERVAL are kept the same way, except now they are only used to derive the number of retries.

Retrying for Subxt RPC error due to closed connection

Sometimes, the runner encounters the following error:

[2024-05-07T13:56:14Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...
[2024-05-07T13:56:15Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

I found this PR which adds a new experimental implementation of an RPC client that automatically reconnects. This implementation is only available in subxt v0.35 or later. I tried bumping the subxt dependencies we use in Spacewalk to that version but I encountered conflicts because our Polkadot dependencies are too outdated.
-> I created #521 as a follow-up and we'll ignore this issue for now.

Related to https://github.com/pendulum-chain/tasks/issues/207.

ebma · 2024-05-07T17:27:59Z

@pendulum-chain/devs this is now ready for review

gianfra-t

Looks good to me! Even a simpler and more transparent solution than the previous retry.

Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)? Similar to what we do in the testing service in node.

Of course this would be a less robust solution than the native subxt solution.

ebma · 2024-05-08T09:24:05Z

Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)?

Good point. I looked into this again and realized that this logic was kind of in place before. In this loop, the runner is restarted if try_get_release() returns an error. The problem was that the changes I made were missing the maximum retry timeout that exists in the backoff crate (max_elapsed_time). Instead of defining a maximum duration for which the logic is retried, we can only specify the number of retries, but I now derive that here so the result is similar. With this, the retry_with_log() functions will now return an error in time that is either caught inside the loop I linked to previously, or if it happens elsewhere will make the runner stop entirely. Once stopped, they can be automatically restarted in the infrastructure.

gianfra-t · 2024-05-08T11:23:48Z

Once stopped, they can be automatically restarted in the infrastructure.

Okay understood, I didn't consider this behavior, makes sense!

bogdanS98

Great changes! 👍🏼

ebma · 2024-05-13T08:48:12Z

Over the weekend, the runner client again encountered the error.

[2024-05-13T08:43:43Z INFO  runner::runner] Error reading chain storage for release: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

Because I slightly changed the phrasing in the log messages, I was able to pin it down to this line which means that the runner was only 'hanging' in this loop statement. Now I noticed that the call to maybe_restart_client doesn't do anything in this case because the child process (the actual vault client binary) is still running and working fine. I tested it and it's still able to process issue and redeem requests, and the RPC connection of the vault client is also working. This is interesting but my assumption is that the RPC connection of client is fine because of its periodic restart every 3 hours, thus the connection is always kind of fresh, whereas the runner never restarts or refreshes its RPC connection. That's why I added some logic to just try and create a new RPC client in the runner.

ebma · 2024-05-15T08:20:41Z

Something was off with the iterator returned by the exponential-backoff crate and it kept on returning values. I removed that crate completely and now we do everything manually. I also added some small tests for the retry_with_log_... functions so that we can now be sure that they don't retry more often than they should.

b-yap · 2024-05-15T10:47:39Z

Cargo.toml

+
+# We need to patch this to https://github.com/tkaitchuck/aHash/releases/tag/v0.8.11 to prevent a build error
+# 'error[E0635]: unknown feature `stdsimd`' that occurs because this feature was removed in the latest nightly versions
+ahash = { git = "https://github.com/tkaitchuck/aHash", rev = "db36e4c4f0606b786bc617eefaffbe4ae9100762" }


I also experienced this, that's why I moved up to nightly-2024-04-18. 😔

Pendulum has 2 ahash dependencies:
https://github.com/pendulum-chain/pendulum/pull/463/files#diff-13ee4b2252c9e516a0547f2891aa2105c3ca71c6d7a1e682c69be97998dfc87eR150-R172

Do you think nightly-2024-04-18 would work in Spacewalk too? Or was there another issue? I don't remember.

I have tried to test with +nightly and it was fine.

Should we define nightly-2024-04-18 in the ~~rust-toolchain file~~ in all references (README/github actions) then and remove this patch for ahash from the Cargo.toml file? Or what do you think @b-yap?

@b-yap any thoughts on this?

mmm, I think the patch isn't needed?
I did a cargo update in my previous PR; the ahash in cargo.lock should be ok for now.
I mentioned the minimum nightly version in the readme; but I did not explain why.
https://github.com/pendulum-chain/pendulum?tab=readme-ov-file#how-to-run-tests
We could add the reason over there.

I removed the patch statement again and updated all references (also in the CI file) to point to the new nightly version. Let's see if the CI passes and then I merge.

as there is no reason not to parallelize the jobs anymore

b-yap · 2024-05-24T09:58:50Z

@ebma you can squash and merge this now. I reverted to older version, as 2024-04-18 has some problems.

Pendulum's CI is using 2024-04-18, but it's using the stable spacewalk when it was still 2024-02-09; and there were no problems so far.

ebma added 7 commits April 18, 2024 11:34

Change/remove backoff::retry() usage in runner

3ab512a

Remove backoff from error.rs

2e6cbc4

Manually implement backoff behaviour

8132db0

Merge branch 'main' into connection-issues-investigation

4c52142

Fix build error

88c108e

Formatting

e92bb7c

Refactor error messages

60d63e9

ebma requested a review from a team May 7, 2024 16:51

ebma marked this pull request as ready for review May 7, 2024 17:06

gianfra-t approved these changes May 7, 2024

View reviewed changes

ebma added 2 commits May 8, 2024 11:07

Restore original logic with max retry interval

a6fe253

Remove import

bc4c6fc

Run cargo +nightly-2024-02-09 fmt --all

32b8d0a

bogdanS98 approved these changes May 8, 2024

View reviewed changes

ebma added 3 commits May 8, 2024 17:38

Run cargo fmt with nightly

93c9bb2

Add logic to reopen the runner websocket

8784c23

Add debug logs

e24496d

ebma added 5 commits May 13, 2024 11:19

Fix compile issues

d553ed7

Add test for the backoff/retry logic

585c267

Simplify retry logic

2bf4dc2

Use test for both

8c2f5cb

Remove unfinished test case and refactor

11504ed

b-yap reviewed May 15, 2024

View reviewed changes

b-yap and others added 2 commits May 16, 2024 16:33

fix trailing semicolons

110dceb

Refactor error handling around try_get_release()

4f3fed9

ebma added 7 commits May 22, 2024 17:29

Merge branch 'main' into connection-issues-investigation

c9dcd4a

Remove patch statement for ahash again

18d8746

Remove patch statement for ahash again

4e8b516

Replace references to nightly-02-09 with nightly-04-18

287116c

Don't log message for each retry

1409157

Change max-parallel to 2

64c7c9a

as there is no reason not to parallelize the jobs anymore

Add some 'allow' statements for clippy

cabd7f0

ebma mentioned this pull request May 23, 2024

Use Foucoco runtime with instant seal pendulum-chain/pendulum#462

Merged

revert version to 2024-02-09

dc2407f

ebma merged commit dae9ddd into main May 24, 2024
2 checks passed

ebma deleted the connection-issues-investigation branch May 24, 2024 15:47

ebma restored the connection-issues-investigation branch May 28, 2024 15:43

ebma deleted the connection-issues-investigation branch July 4, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add changes to fix issues running the Spacewalk client in Kubernetes #519

Add changes to fix issues running the Spacewalk client in Kubernetes #519

ebma commented May 2, 2024 •

edited

Loading

ebma commented May 7, 2024

gianfra-t left a comment

ebma commented May 8, 2024

gianfra-t commented May 8, 2024

bogdanS98 left a comment •

edited

Loading

ebma commented May 13, 2024

ebma commented May 15, 2024

b-yap May 15, 2024

ebma May 15, 2024

b-yap May 15, 2024

ebma May 16, 2024 •

edited

Loading

ebma May 21, 2024

b-yap May 22, 2024 •

edited

Loading

ebma May 22, 2024 •

edited

Loading

b-yap commented May 24, 2024

Add changes to fix issues running the Spacewalk client in Kubernetes #519

Add changes to fix issues running the Spacewalk client in Kubernetes #519

Conversation

ebma commented May 2, 2024 • edited Loading

Retrying for Subxt RPC error due to closed connection

ebma commented May 7, 2024

gianfra-t left a comment

Choose a reason for hiding this comment

ebma commented May 8, 2024

gianfra-t commented May 8, 2024

bogdanS98 left a comment • edited Loading

Choose a reason for hiding this comment

ebma commented May 13, 2024

ebma commented May 15, 2024

b-yap May 15, 2024

Choose a reason for hiding this comment

ebma May 15, 2024

Choose a reason for hiding this comment

b-yap May 15, 2024

Choose a reason for hiding this comment

ebma May 16, 2024 • edited Loading

Choose a reason for hiding this comment

ebma May 21, 2024

Choose a reason for hiding this comment

b-yap May 22, 2024 • edited Loading

Choose a reason for hiding this comment

ebma May 22, 2024 • edited Loading

Choose a reason for hiding this comment

b-yap commented May 24, 2024

ebma commented May 2, 2024 •

edited

Loading

bogdanS98 left a comment •

edited

Loading

ebma May 16, 2024 •

edited

Loading

b-yap May 22, 2024 •

edited

Loading

ebma May 22, 2024 •

edited

Loading