CI friendly fast mode for try state checks #13382

Ank4n · 2023-02-13T22:24:18Z

Reviving #13286
Should resolve paritytech/polkadot-sdk#234

Adds a new fast mode to TryStateSelect which skips slow running state checks.

Adds two new options [fast-all | fast-try-state] to on-runtime-upgrade --checks.

Example usage:

 ./polkadot \
     try-runtime \
     --runtime runtime-try-runtime.wasm \
     -lruntime=debug \
     on-runtime-upgrade \
     --checks=fast-all \
     live \
     --uri ws://localhost:9944

frame/staking/src/pallet/mod.rs

frame/nomination-pools/src/lib.rs

kianenigma · 2023-02-14T13:23:22Z

frame/executive/src/lib.rs

@@ -353,7 +353,7 @@ where
 			let _guard = frame_support::StorageNoopGuard::default();
 			<AllPalletsWithSystem as frame_support::traits::TryState<System::BlockNumber>>::try_state(
 				frame_system::Pallet::<System>::block_number(),
-				frame_try_runtime::TryStateSelect::All,
+				frame_try_runtime::TryStateSelect::Fast,


It might complicate things too much, but how about:

pub enum UpgradeCheckSelect { /// Run no checks. None, /// Run the `try_state`, `pre_upgrade` and `post_upgrade` checks. All, /// Run the `pre_upgrade` and `post_upgrade` checks. PreAndPost, /// Run the `try_state` checks. TryState(TryStateSelect), }

This will give us the ultimate power to customize this, and hopefully the default behavior remains the same.

@ggwpez wdyt?

So basically @Ank4n had asked:

Should we add another option in UpgradeCheckSelect to choose try-state-fast checks?

I'd say yes.

Selecting TryState won't run pre and post checks though.

frame/bags-list/src/lib.rs

frame/nomination-pools/src/lib.rs

kianenigma · 2023-03-18T21:01:29Z

frame/support/src/traits/try_runtime.rs

@@ -34,6 +34,10 @@ pub enum Select {
 	///
 	/// Pallet names are obtained from [`super::PalletInfoAccess`].
 	Only(Vec<Vec<u8>>),
+	/// Run only fast running tests.


This will run all pallets, but in fast mode right? Or else I am not sure how it work with that regards.

Yes, it is same as All but the pallets can choose to ignore some tests in the fast mode.

frame/support/src/traits/hooks.rs

kianenigma

Needs a bit more work, looking forward to it + finally enabling these test in CI.

This is also relevant for #13013, probably decoding the entire state is not something that we want to do in CI. Not sure.

All in all, my priority is to make these test and the CI work. @Ank4n fwiw I would happily approve a temp solution that simply reduces the try_state of all pallets, including staking, to be something sensible (perhaps feature gated by env!("CI_EXEC") as a quick hack) and then we can incrementally introduce them back, or think about how to do it.

In other words, in the case of this PR, getting something out there fast is more important than perfection.

Please check with @ggwpez about enabling CI checks as a lot of the recent CI checks have been his curtesy.

stale · 2023-04-17T21:16:31Z

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

kianenigma · 2023-04-20T08:56:05Z

Updates here?

kianenigma · 2023-05-16T09:19:48Z

Updates here?

Repeat 😁

Ank4n · 2023-05-27T04:15:26Z

utils/frame/try-runtime/cli/src/commands/on_runtime_upgrade.rs

 	/// - `pre-and-post`: Perform pre- and post-upgrade checks.
 	/// - `try-state`: Perform the try-state checks.
+	/// - `fast-try-state`: Perform fast running state checks.


May be a better approach would be to add another option to the OnRuntimeUpgrade command, such as mode: [fast | normal].

liamaharon

This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.

Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.
- At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.
The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.
The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.

I'm very open to being shown why the cost is justified here, just want to make sure this has been considered

Ank4n · 2023-05-29T10:52:07Z

@liamaharon Thanks for raising those points. Its still all open to discussion but I will write how I was thinking about this.

This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.

Currently the try-state checks are disabled on CI (it only runs pre- and post-upgrade checks). So in a way what you are suggesting is what is happening currently.

Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.

Another way to think about it is, this would allow every pallet to write very thorough try state tests. Some of these tests may take really long time (probably hours) and it would be great to run all of them on CI but we also don't want CI to be stuck for hours. The pallet developer still believes other tests would ensure 99.9% scenarios are covered and this slow running test is something they run occasionally outside CI to ensure all storage items are consistent to the expectations.

To give a more concrete example (which is actually the reason why we wanted to introduce the fast mode) we can look at this try-state check in staking pallet. It iterates over all nominators (~ 44k) and then for each nominator, it fetches all active validators (~ 300), gets their exposure, iterates over all of its stakers (could be upto 512) and checks if the nominator (in the first loop) is only present once in a validator's exposure. This is cubic time complexity and takes more than 2 hours to run currently.

Also, most of the times, a failing try-state checks may not be introduced by the current PR but through a sneaky bug that we only found out about later. I think we might even want to keep these tests to allow_failure and these are just indications to investigate and not blockers to a PR.

To summarise, I think what this PR is trying to do is find a middle ground where we can run try state checks in the CI that would cover most of the inconsistent state scenario but still enable pallet developer write some more thorough but expensive tests that they can run outside CI if they want to. Its also important to emphasise try-state check should only be our 3rd or 4th line of defence.

At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.

The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.

The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.

I agree we should look to improve this, make it more intuitive and better documented. By default every test should be marked as fast test unless we notice they are taking really long to run in the CI. I do think though that running 99% of the try state checks on CI and ignoring one is better alternative than not running anything on CI. If there are complex changes to a pallet logic, a developer should always run full suite (may be we can have a bot command that runs the full suite on demand).

gpestana

Neat!

gpestana · 2023-06-08T21:06:46Z

frame/support/src/traits/hooks.rs

@@ -306,9 +308,12 @@ pub trait Hooks<BlockNumber> {
 	/// It should focus on certain checks to ensure that the state is sensible. This is never
 	/// executed in a consensus code-path, therefore it can consume as much weight as it needs.
 	///
+	/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to


Suggested change

/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to

/// Takes the block number and `TryStateSelect` as a parameter. The `TryStateSelect` is used to

gpestana · 2023-06-08T21:14:21Z

frame/staking/src/mock.rs

@@ -550,7 +550,7 @@ impl ExtBuilder {
 		let mut ext = self.build();
 		ext.execute_with(test);
 		ext.execute_with(|| {
-			Staking::do_try_state(System::block_number()).unwrap();
+			Staking::do_try_state(System::block_number(), false).unwrap();


maybe it would be more clear for the reader to pass one of [TryStateSelect::Fast, TryStateSelect::All] here instead of bool.

kianenigma

I strongly think usage of TryStateSelect is not correct here.

kianenigma · 2023-06-12T12:49:35Z

frame/collective/src/lib.rs

-		fn try_state(_n: BlockNumberFor<T>) -> Result<(), TryRuntimeError> {
+		fn try_state(
+			_n: BlockNumberFor<T>,
+			_s: frame_support::traits::TryStateSelect,


This API is unfortunately not correct.

You are right to pass in something into the hook that helps it understand if it is fast or not, but TryStateSelect is not the right type here.

TryStateSelect is meant to identify which pallets to execute, not how much time they should each consume. It is only interpreted by the Executive and should not be exposed to the end user here at all.

What we instead want is a TryStateSpeed which you can for now assume to bool or enum TryStateSpeed { Slow, Mid, Fast }.

Then, you are capable of selecting which pallets to run, and at what speed.

As it stands now, I see this flaw as welll:

A pallet could possibly see RoundRobin(7) as its try-state select. How should it interpret this? Answer: it cannot, because it is not the right audience for it.

Ank4n · 2023-06-12T22:33:30Z

bot rebase

paritytech-processbot · 2023-06-12T22:33:45Z

Rebased

stale · 2023-07-12T23:54:09Z

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Ank4n · 2023-07-13T10:23:22Z

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Yes, will be reworking on this.

Ank4n added 5 commits February 13, 2023 23:10

ignore slow running tests in CI

7be05b8

choose type of test from cli

63c1f43

add fast option to try state checks

17a9911

remove unused

2e5b962

fmt

2af73ee

Ank4n requested a review from kianenigma as a code owner February 13, 2023 22:24

Ank4n added 2 commits February 13, 2023 23:32

feature gate import

25843d9

fmt

adefb2d

kianenigma reviewed Feb 14, 2023

View reviewed changes

frame/staking/src/pallet/mod.rs Outdated Show resolved Hide resolved

kianenigma reviewed Feb 14, 2023

View reviewed changes

frame/nomination-pools/src/lib.rs Outdated Show resolved Hide resolved

kianenigma reviewed Feb 14, 2023

View reviewed changes

Ank4n requested a review from ggwpez February 20, 2023 13:49