Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add changes to fix issues running the Spacewalk client in Kubernetes #519

Merged
merged 28 commits into from
May 24, 2024

Conversation

ebma
Copy link
Member

@ebma ebma commented May 2, 2024

Removes the backoff crate. Assuming that the retry() function introduces some side effects, we manually implement the retry logic now. The existing constants for RETRY_TIMEOUT and RETRY_INTERVAL are kept the same way, except now they are only used to derive the number of retries.

Retrying for Subxt RPC error due to closed connection

Sometimes, the runner encounters the following error:

[2024-05-07T13:56:14Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...
[2024-05-07T13:56:15Z INFO  runner::runner] Error fetching executable: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

I found this PR which adds a new experimental implementation of an RPC client that automatically reconnects. This implementation is only available in subxt v0.35 or later. I tried bumping the subxt dependencies we use in Spacewalk to that version but I encountered conflicts because our Polkadot dependencies are too outdated.
-> I created #521 as a follow-up and we'll ignore this issue for now.

Related to https://github.com/pendulum-chain/tasks/issues/207.

@ebma ebma requested a review from a team May 7, 2024 16:51
@ebma ebma marked this pull request as ready for review May 7, 2024 17:06
@ebma
Copy link
Member Author

ebma commented May 7, 2024

@pendulum-chain/devs this is now ready for review

Copy link
Contributor

@gianfra-t gianfra-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Even a simpler and more transparent solution than the previous retry.

Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)? Similar to what we do in the testing service in node.

Of course this would be a less robust solution than the native subxt solution.

@ebma
Copy link
Member Author

ebma commented May 8, 2024

Regarding the error due to close connection, could we retry a new one after a failure here by simply creating a new client instance with Ok(OnlineClient::from_url(url).await?)?

Good point. I looked into this again and realized that this logic was kind of in place before. In this loop, the runner is restarted if try_get_release() returns an error. The problem was that the changes I made were missing the maximum retry timeout that exists in the backoff crate (max_elapsed_time). Instead of defining a maximum duration for which the logic is retried, we can only specify the number of retries, but I now derive that here so the result is similar. With this, the retry_with_log() functions will now return an error in time that is either caught inside the loop I linked to previously, or if it happens elsewhere will make the runner stop entirely. Once stopped, they can be automatically restarted in the infrastructure.

@gianfra-t
Copy link
Contributor

Once stopped, they can be automatically restarted in the infrastructure.

Okay understood, I didn't consider this behavior, makes sense!

Copy link
Contributor

@bogdanS98 bogdanS98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes! 👍🏼

@ebma
Copy link
Member Author

ebma commented May 13, 2024

Over the weekend, the runner client again encountered the error.

[2024-05-13T08:43:43Z INFO  runner::runner] Error reading chain storage for release: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...

Because I slightly changed the phrasing in the log messages, I was able to pin it down to this line which means that the runner was only 'hanging' in this loop statement. Now I noticed that the call to maybe_restart_client doesn't do anything in this case because the child process (the actual vault client binary) is still running and working fine. I tested it and it's still able to process issue and redeem requests, and the RPC connection of the vault client is also working. This is interesting but my assumption is that the RPC connection of client is fine because of its periodic restart every 3 hours, thus the connection is always kind of fresh, whereas the runner never restarts or refreshes its RPC connection. That's why I added some logic to just try and create a new RPC client in the runner.

@ebma
Copy link
Member Author

ebma commented May 15, 2024

Something was off with the iterator returned by the exponential-backoff crate and it kept on returning values. I removed that crate completely and now we do everything manually. I also added some small tests for the retry_with_log_... functions so that we can now be sure that they don't retry more often than they should.
Screenshot 2024-05-13 at 19 27 01

Cargo.toml Outdated

# We need to patch this to https://github.com/tkaitchuck/aHash/releases/tag/v0.8.11 to prevent a build error
# 'error[E0635]: unknown feature `stdsimd`' that occurs because this feature was removed in the latest nightly versions
ahash = { git = "https://github.com/tkaitchuck/aHash", rev = "db36e4c4f0606b786bc617eefaffbe4ae9100762" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think nightly-2024-04-18 would work in Spacewalk too? Or was there another issue? I don't remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to test with +nightly and it was fine.

Copy link
Member Author

@ebma ebma May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we define nightly-2024-04-18 in the rust-toolchain file in all references (README/github actions) then and remove this patch for ahash from the Cargo.toml file? Or what do you think @b-yap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@b-yap any thoughts on this?

Copy link
Contributor

@b-yap b-yap May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, I think the patch isn't needed?
I did a cargo update in my previous PR; the ahash in cargo.lock should be ok for now.
I mentioned the minimum nightly version in the readme; but I did not explain why.
https://github.com/pendulum-chain/pendulum?tab=readme-ov-file#how-to-run-tests
We could add the reason over there.

Copy link
Member Author

@ebma ebma May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the patch statement again and updated all references (also in the CI file) to point to the new nightly version. Let's see if the CI passes and then I merge.

@b-yap
Copy link
Contributor

b-yap commented May 24, 2024

@ebma you can squash and merge this now. I reverted to older version, as 2024-04-18 has some problems.

Pendulum's CI is using 2024-04-18, but it's using the stable spacewalk when it was still 2024-02-09; and there were no problems so far.

@ebma ebma merged commit dae9ddd into main May 24, 2024
2 checks passed
@ebma ebma deleted the connection-issues-investigation branch May 24, 2024 15:47
@ebma ebma restored the connection-issues-investigation branch May 28, 2024 15:43
@ebma ebma deleted the connection-issues-investigation branch July 4, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants