452 vault client retries transaction submission for unrecoverable errors #453

gianfra-t · 2023-11-24T21:14:30Z

Closes #452

Changes

Uses the string data filed on the parsed error to detect recoverable errors (lack of funds, incorrect nonce).
Otherwise, the service will avoid retry of the same call, since it is impossible to obtain a different result without modifying it (except vault operator's input).
A new condition is added so that if the error is of type POOL_UNACTIONABLE or POOL_UNKNOWN_VALIDITY (which are not necessarily invalid transactions) the vault will retry the transaction.
If the service fails to start it will send a shutdown signal to avoid waiting a dead process.

… defined

b-yap

is it possible to recreate POOL_UNACTIONABLE and/or POOL_UNKNOWN_VALIDITY and test it?

examples:
POOL_INVALID_TX
POOL_TOO_LOW_PRIORITY

clients/runtime/src/error.rs

gianfra-t · 2023-11-28T16:12:29Z

is it possible to recreate POOL_UNACTIONABLE and/or POOL_UNKNOWN_VALIDITY and test it?

examples: POOL_INVALID_TX POOL_TOO_LOW_PRIORITY

I am not sure we can test this since we would need to modify the pool logic. I could not find much information about these errors, but the assumption was that a pool error would not be a problem with the transaction itself.

I removed POOL_UNACTIONABLE since by looking a little more careful, this can only occur when the transaction's propagate value is set to false and the local node is not producing blocks see in code. In our case, propagate will always be true since this is handled by subxt and is the value by default.

Update: I also removed POOL_UNKNOWN_VALIDITY

b-yap

Ah, the error still occurred:

Dec 05 18:42:57.878  WARN runtime::retry: Subxt runtime error: Runtime error: Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }) - next retry in 0.904 s    
Dec 05 18:42:57.879  INFO vault::requests::structs: Successfully executed Redeem request #0x6ec8…9392
Dec 05 18:42:57.879  INFO vault::requests::execution: Successfully executed Redeem request #0x6ec8…9392
Dec 05 18:42:58.198  INFO vault::requests::structs: Successfully executed Redeem request #0xcb6f…4fbd
Dec 05 18:42:58.200  INFO vault::requests::execution: Successfully executed Redeem request #0xcb6f…4fbd
Dec 05 18:43:50.791  WARN runtime::retry: Subxt runtime error: Runtime error: Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }) - next retry in 1.691 s

Let's take @ebma 's assumption and go ahead flag as "unrecoverable" the module errors.

gianfra-t · 2023-12-05T14:43:38Z

Thanks @b-yap for testing this. I found that there is a second call to notify_retry inside the vault module. Most likely this is why you still were able to see the vault retrying an unrecoverable error.

As soon as I can test this and replicate that scenario I will update this comment.

Update: Running locally the vault I know get that the vault is not retrying with Quorum Errors and gives up on the request.

Dec 05 15:20:33.703 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:20:33.703  INFO vault::requests::execution: Performing retry #1 out of 3 retries for Redeem request #0x269b…2345
Dec 05 15:20:33.731  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 49015483
Dec 05 15:21:07.227 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:21:07.228  INFO vault::requests::execution: Performing retry #2 out of 3 retries for Redeem request #0x269b…2345
Dec 05 15:21:07.239  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 49015483
Dec 05 15:21:28.972 ERROR stellar_relay_lib::connection::services: connection_handler(): Timeout of deadline has elapsed seconds elapsed for reading messages from Stellar Node. Retry: #0    
Dec 05 15:21:28.972 ERROR stellar_relay_lib::connection::services: receiving_service(): proc_id: 2462. Timeout of deadline has elapsed seconds elapsed for reading messages from Stellar Node. Retry: #0    
Dec 05 15:21:44.907 ERROR vault::requests::execution: Failed to execute Redeem request #0x269b…2345 because of error: RuntimeError(SubxtRuntimeError(Runtime(Module(ModuleError { pallet: "StellarRelay", error: "InvalidQuorumSetNotEnoughValidators", description: [], error_data: ModuleErrorData { pallet_index: 68, error: [12, 0, 0, 0] } }))))
Dec 05 15:21:44.907 ERROR vault::requests::execution: Exceeded max number of retries (3) to execute Redeem request #0x269b2b6df51ccd57f7f8e3ce609508df7a31b3192932a1376269934945be2345. Giving up...

b-yap

let's wait for @ebma

ebma

It looks good overall. Though one remark that I noticed while testing it:
If the extrinsic to execute the request fails and returns an error, then we are still retrying it up to three times. I think we should not increment the retry_counter here but rather break and exit from the loop. Because we know that the request cannot be executed when we retry.
Or do you remember if there was a good reason to also add the retry here @b-yap? It makes sense in this error block but not in the other one IMO.

clients/vault/src/requests/structs.rs

gianfra-t · 2023-12-07T15:27:55Z

Or do you remember if there was a good reason to also add the retry here @b-yap? It makes sense in this error block but not in the other one IMO.

But if there is a Skip and we keep retrying (inner retry loop) we could eventually hit here which will also return an error, but not necessarily unrecoverable as we have defined. Perhaps we could break unless it is a Timeout variant.

ebma

I think we can ignore the Timeout that is thrown in the backoff policy here.
Maybe I'm wrong but the way I see it, then with our current implementation, we will stay here until we either hit an Unrecoverable error or the error is recoverable and it was already retried so many times that we encounter the Timeout error you mentioned. In both cases we wouldn't want to retry in this outer loop again. Because, if it was Unrecoverable we don't want to retry and if we get the Timeout then we already retried a lot anyways.

clients/vault/src/requests/structs.rs

gianfra-t · 2023-12-07T16:16:15Z

Yes you may be right about the outer retry loop. Either way, if we needed more retries just changing the backoff policy would be enough. Let's wait for @b-yap take on it? Otherwise we just remove the counter there and put a break like you say.

b-yap · 2023-12-07T16:30:35Z

@gianfra-t Yes you can remove the retry_count.

ebma

Great changes, good job @gianfra-t 👍

Tested it locally and looks good to me. Ready for merge once CI passed.

gianfra-t added 4 commits November 23, 2023 19:05

avoid retry when error is will fail always

22c2d41

default to unrecoverable error

aed623e

send shutdown signal when starting the vault fails unrecoverably

59d8354

retry transaction if unhandled pool issue

1cc7086

gianfra-t linked an issue Nov 24, 2023 that may be closed by this pull request

Vault client retries transaction submission for unrecoverable errors #452

Closed

refactor how recoverable errors and the corresponding error string is…

df26b35

… defined

gianfra-t requested a review from a team November 27, 2023 14:15

gianfra-t marked this pull request as ready for review November 27, 2023 14:16

fixes

883f9e4

b-yap requested changes Nov 28, 2023

View reviewed changes

clients/runtime/src/error.rs Outdated Show resolved Hide resolved

clients/runtime/src/error.rs Show resolved Hide resolved

b-yap requested a review from ebma November 28, 2023 14:55

b-yap reviewed Nov 28, 2023

View reviewed changes

clients/runtime/src/error.rs Outdated Show resolved Hide resolved

gianfra-t added 3 commits November 28, 2023 13:27

comments and small refactor

cf90b5e

remove pool_unactionable from retry

c34385f

remove pool issue

3adc936

gianfra-t requested a review from b-yap December 4, 2023 15:18

b-yap requested changes Dec 5, 2023

View reviewed changes

handle error when calling notify retry in vault

de54191

format

0251fcf

gianfra-t requested a review from b-yap December 6, 2023 20:13

b-yap approved these changes Dec 7, 2023

View reviewed changes

ebma reviewed Dec 7, 2023

View reviewed changes

clients/vault/src/requests/structs.rs Outdated Show resolved Hide resolved

ebma reviewed Dec 7, 2023

View reviewed changes

clients/vault/src/requests/structs.rs Outdated Show resolved Hide resolved

remove outer retry execution loop

81ddb66

ebma approved these changes Dec 7, 2023

View reviewed changes

cargo fmt

691014c

gianfra-t merged commit 2ee2f62 into main Dec 8, 2023
2 checks passed

gianfra-t deleted the 452-vault-client-retries-transaction-submission-for-unrecoverable-errors branch December 8, 2023 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

452 vault client retries transaction submission for unrecoverable errors #453

452 vault client retries transaction submission for unrecoverable errors #453

gianfra-t commented Nov 24, 2023 •

edited

Loading

b-yap left a comment

gianfra-t commented Nov 28, 2023 •

edited

Loading

b-yap left a comment

gianfra-t commented Dec 5, 2023 •

edited

Loading

b-yap left a comment

ebma left a comment

gianfra-t commented Dec 7, 2023

ebma left a comment

gianfra-t commented Dec 7, 2023

b-yap commented Dec 7, 2023

ebma left a comment

452 vault client retries transaction submission for unrecoverable errors #453

452 vault client retries transaction submission for unrecoverable errors #453

Conversation

gianfra-t commented Nov 24, 2023 • edited Loading

Changes

b-yap left a comment

Choose a reason for hiding this comment

gianfra-t commented Nov 28, 2023 • edited Loading

b-yap left a comment

Choose a reason for hiding this comment

gianfra-t commented Dec 5, 2023 • edited Loading

b-yap left a comment

Choose a reason for hiding this comment

ebma left a comment

Choose a reason for hiding this comment

gianfra-t commented Dec 7, 2023

ebma left a comment

Choose a reason for hiding this comment

gianfra-t commented Dec 7, 2023

b-yap commented Dec 7, 2023

ebma left a comment

Choose a reason for hiding this comment

gianfra-t commented Nov 24, 2023 •

edited

Loading

gianfra-t commented Nov 28, 2023 •

edited

Loading

gianfra-t commented Dec 5, 2023 •

edited

Loading