Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

452 vault client retries transaction submission for unrecoverable errors #453

Conversation

gianfra-t
Copy link
Contributor

@gianfra-t gianfra-t commented Nov 24, 2023

Closes #452

Changes

  • Uses the string data filed on the parsed error to detect recoverable errors (lack of funds, incorrect nonce).
  • Otherwise, the service will avoid retry of the same call, since it is impossible to obtain a different result without modifying it (except vault operator's input).
  • A new condition is added so that if the error is of type POOL_UNACTIONABLE or POOL_UNKNOWN_VALIDITY (which are not necessarily invalid transactions) the vault will retry the transaction.
  • If the service fails to start it will send a shutdown signal to avoid waiting a dead process.

@gianfra-t gianfra-t linked an issue Nov 24, 2023 that may be closed by this pull request
@gianfra-t gianfra-t requested a review from a team November 27, 2023 14:15
@gianfra-t gianfra-t marked this pull request as ready for review November 27, 2023 14:16
Copy link
Contributor

@b-yap b-yap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to recreate POOL_UNACTIONABLE and/or POOL_UNKNOWN_VALIDITY and test it?

examples:
POOL_INVALID_TX
POOL_TOO_LOW_PRIORITY

clients/runtime/src/error.rs Outdated Show resolved Hide resolved
clients/runtime/src/error.rs Show resolved Hide resolved
@b-yap b-yap requested a review from ebma November 28, 2023 14:55
clients/runtime/src/error.rs Outdated Show resolved Hide resolved
@gianfra-t
Copy link
Contributor Author

gianfra-t commented Nov 28, 2023

is it possible to recreate POOL_UNACTIONABLE and/or POOL_UNKNOWN_VALIDITY and test it?

examples: POOL_INVALID_TX POOL_TOO_LOW_PRIORITY

I am not sure we can test this since we would need to modify the pool logic. I could not find much information about these errors, but the assumption was that a pool error would not be a problem with the transaction itself.

I removed POOL_UNACTIONABLE since by looking a little more careful, this can only occur when the transaction's propagate value is set to false and the local node is not producing blocks see in code. In our case, propagate will always be true since this is handled by subxt and is the value by default.

Update: I also removed POOL_UNKNOWN_VALIDITY

@gianfra-t gianfra-t requested a review from b-yap December 4, 2023 15:18
Copy link
Contributor

@b-yap b-yap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, the error still occurred:

Dec 05 18:42:57.878  WARN runtime::retry: Subxt runtime error: Runtime error: Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }) - next retry in 0.904 s    
Dec 05 18:42:57.879  INFO vault::requests::structs: Successfully executed Redeem request #0x6ec8…9392
Dec 05 18:42:57.879  INFO vault::requests::execution: Successfully executed Redeem request #0x6ec8…9392
Dec 05 18:42:58.198  INFO vault::requests::structs: Successfully executed Redeem request #0xcb6f…4fbd
Dec 05 18:42:58.200  INFO vault::requests::execution: Successfully executed Redeem request #0xcb6f…4fbd
Dec 05 18:43:50.791  WARN runtime::retry: Subxt runtime error: Runtime error: Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }) - next retry in 1.691 s 

Let's take @ebma 's assumption and go ahead flag as "unrecoverable" the module errors.

@gianfra-t
Copy link
Contributor Author

gianfra-t commented Dec 5, 2023

Thanks @b-yap for testing this. I found that there is a second call to notify_retry inside the vault module. Most likely this is why you still were able to see the vault retrying an unrecoverable error.

As soon as I can test this and replicate that scenario I will update this comment.

Update: Running locally the vault I know get that the vault is not retrying with Quorum Errors and gives up on the request.

Dec 05 15:20:33.703 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:20:33.703  INFO vault::requests::execution: Performing retry #1 out of 3 retries for Redeem request #0x269b…2345
Dec 05 15:20:33.731  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 49015483
Dec 05 15:21:07.227 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:21:07.228  INFO vault::requests::execution: Performing retry #2 out of 3 retries for Redeem request #0x269b…2345
Dec 05 15:21:07.239  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 49015483
Dec 05 15:21:28.972 ERROR stellar_relay_lib::connection::services: connection_handler(): Timeout of deadline has elapsed seconds elapsed for reading messages from Stellar Node. Retry: #0    
Dec 05 15:21:28.972 ERROR stellar_relay_lib::connection::services: receiving_service(): proc_id: 2462. Timeout of deadline has elapsed seconds elapsed for reading messages from Stellar Node. Retry: #0    
Dec 05 15:21:44.907 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:21:44.907 ERROR vault::requests::execution: Exceeded max number of retries (3) to execute Redeem request #0x269b2b6df51ccd57f7f8e3ce609508df7a31b3192932a1376269934945be2345. Giving up...

@gianfra-t gianfra-t requested a review from b-yap December 6, 2023 20:13
Copy link
Contributor

@b-yap b-yap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's wait for @ebma

Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good overall. Though one remark that I noticed while testing it:
If the extrinsic to execute the request fails and returns an error, then we are still retrying it up to three times. I think we should not increment the retry_counter here but rather break and exit from the loop. Because we know that the request cannot be executed when we retry.
Or do you remember if there was a good reason to also add the retry here @b-yap? It makes sense in this error block but not in the other one IMO.

@gianfra-t
Copy link
Contributor Author

Or do you remember if there was a good reason to also add the retry here @b-yap? It makes sense in this error block but not in the other one IMO.

But if there is a Skip and we keep retrying (inner retry loop) we could eventually hit here which will also return an error, but not necessarily unrecoverable as we have defined. Perhaps we could break unless it is a Timeout variant.

Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can ignore the Timeout that is thrown in the backoff policy here.
Maybe I'm wrong but the way I see it, then with our current implementation, we will stay here until we either hit an Unrecoverable error or the error is recoverable and it was already retried so many times that we encounter the Timeout error you mentioned. In both cases we wouldn't want to retry in this outer loop again. Because, if it was Unrecoverable we don't want to retry and if we get the Timeout then we already retried a lot anyways.

clients/vault/src/requests/structs.rs Outdated Show resolved Hide resolved
@gianfra-t
Copy link
Contributor Author

Yes you may be right about the outer retry loop. Either way, if we needed more retries just changing the backoff policy would be enough. Let's wait for @b-yap take on it? Otherwise we just remove the counter there and put a break like you say.

@b-yap
Copy link
Contributor

b-yap commented Dec 7, 2023

@gianfra-t Yes you can remove the retry_count.

Copy link
Member

@ebma ebma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes, good job @gianfra-t 👍

Tested it locally and looks good to me. Ready for merge once CI passed.

@gianfra-t gianfra-t merged commit 2ee2f62 into main Dec 8, 2023
2 checks passed
@gianfra-t gianfra-t deleted the 452-vault-client-retries-transaction-submission-for-unrecoverable-errors branch December 8, 2023 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vault client retries transaction submission for unrecoverable errors
3 participants