Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

CI friendly fast mode for try state checks #13382

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from
Draft

Conversation

Ank4n
Copy link
Contributor

@Ank4n Ank4n commented Feb 13, 2023

Reviving #13286
Should resolve paritytech/polkadot-sdk#234

Adds a new fast mode to TryStateSelect which skips slow running state checks.

Adds two new options [fast-all | fast-try-state] to on-runtime-upgrade --checks.

Example usage:

 ./polkadot \
     try-runtime \
     --runtime runtime-try-runtime.wasm \
     -lruntime=debug \
     on-runtime-upgrade \
     --checks=fast-all \
     live \
     --uri ws://localhost:9944

@Ank4n Ank4n added A0-please_review Pull request needs code review. C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit. B1-note_worthy Changes should be noted in the release notes B0-silent Changes should not be mentioned in any release notes and removed B1-note_worthy Changes should be noted in the release notes labels Feb 13, 2023
@@ -353,7 +353,7 @@ where
let _guard = frame_support::StorageNoopGuard::default();
<AllPalletsWithSystem as frame_support::traits::TryState<System::BlockNumber>>::try_state(
frame_system::Pallet::<System>::block_number(),
frame_try_runtime::TryStateSelect::All,
frame_try_runtime::TryStateSelect::Fast,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might complicate things too much, but how about:

pub enum UpgradeCheckSelect {
	/// Run no checks.
	None,
	/// Run the `try_state`, `pre_upgrade` and `post_upgrade` checks.
	All,
	/// Run the `pre_upgrade` and `post_upgrade` checks.
	PreAndPost,
	/// Run the `try_state` checks.
	TryState(TryStateSelect),
}

This will give us the ultimate power to customize this, and hopefully the default behavior remains the same.

@ggwpez wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically @Ank4n had asked:

Should we add another option in UpgradeCheckSelect to choose try-state-fast checks?

I'd say yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selecting TryState won't run pre and post checks though.

@Ank4n Ank4n requested a review from ggwpez February 20, 2023 13:49
@@ -34,6 +34,10 @@ pub enum Select {
///
/// Pallet names are obtained from [`super::PalletInfoAccess`].
Only(Vec<Vec<u8>>),
/// Run only fast running tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will run all pallets, but in fast mode right? Or else I am not sure how it work with that regards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is same as All but the pallets can choose to ignore some tests in the fast mode.

Copy link
Contributor

@kianenigma kianenigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a bit more work, looking forward to it + finally enabling these test in CI.

This is also relevant for #13013, probably decoding the entire state is not something that we want to do in CI. Not sure.

All in all, my priority is to make these test and the CI work. @Ank4n fwiw I would happily approve a temp solution that simply reduces the try_state of all pallets, including staking, to be something sensible (perhaps feature gated by env!("CI_EXEC") as a quick hack) and then we can incrementally introduce them back, or think about how to do it.

In other words, in the case of this PR, getting something out there fast is more important than perfection.

Please check with @ggwpez about enabling CI checks as a lot of the recent CI checks have been his curtesy.

@stale
Copy link

stale bot commented Apr 17, 2023

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A3-stale label Apr 17, 2023
@kianenigma
Copy link
Contributor

Updates here?

@stale stale bot removed the A3-stale label Apr 20, 2023
@kianenigma
Copy link
Contributor

Updates here?

Repeat 😁

@Ank4n Ank4n requested review from a team May 26, 2023 17:05
@Ank4n Ank4n added A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. labels May 26, 2023
/// - `pre-and-post`: Perform pre- and post-upgrade checks.
/// - `try-state`: Perform the try-state checks.
/// - `fast-try-state`: Perform fast running state checks.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better approach would be to add another option to the OnRuntimeUpgrade command, such as mode: [fast | normal].

@Ank4n Ank4n requested a review from kianenigma May 27, 2023 04:16
Copy link
Contributor

@liamaharon liamaharon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.

  • Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.
    • At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.
  • The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.
  • The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.

I'm very open to being shown why the cost is justified here, just want to make sure this has been considered

@paritytech-ci paritytech-ci requested a review from a team May 29, 2023 08:56
@Ank4n
Copy link
Contributor Author

Ank4n commented May 29, 2023

@liamaharon Thanks for raising those points. Its still all open to discussion but I will write how I was thinking about this.

This is a cool idea, but I'd like to raise the possibility that it may be better to completely turn off try-state checks in the CI than to have them run partially.

Currently the try-state checks are disabled on CI (it only runs pre- and post-upgrade checks). So in a way what you are suggesting is what is happening currently.

  • Run partially, the green CI check loses its meaning. It no longer would signify that try-state hooks are passing, rather just that they may be passing. To be sure they're actually passing, they would need to be run again somewhere else anyway, so there seems to be little value in running them in the CI.

Another way to think about it is, this would allow every pallet to write very thorough try state tests. Some of these tests may take really long time (probably hours) and it would be great to run all of them on CI but we also don't want CI to be stuck for hours. The pallet developer still believes other tests would ensure 99.9% scenarios are covered and this slow running test is something they run occasionally outside CI to ensure all storage items are consistent to the expectations.

To give a more concrete example (which is actually the reason why we wanted to introduce the fast mode) we can look at this try-state check in staking pallet. It iterates over all nominators (~ 44k) and then for each nominator, it fetches all active validators (~ 300), gets their exposure, iterates over all of its stakers (could be upto 512) and checks if the nominator (in the first loop) is only present once in a validator's exposure. This is cubic time complexity and takes more than 2 hours to run currently.

Also, most of the times, a failing try-state checks may not be introduced by the current PR but through a sneaky bug that we only found out about later. I think we might even want to keep these tests to allow_failure and these are just indications to investigate and not blockers to a PR.

To summarise, I think what this PR is trying to do is find a middle ground where we can run try state checks in the CI that would cover most of the inconsistent state scenario but still enable pallet developer write some more thorough but expensive tests that they can run outside CI if they want to. Its also important to emphasise try-state check should only be our 3rd or 4th line of defence.

  • At most, they could alert the dev to some failing hooks. But, the dev would still need to run the full set of try-state hooks somewhere else anyway.
  • The green check may provide a false sense of security to devs who're unaware that the full set of hooks are not run in the CI.
  • The feature comes with a cost: it adds extra configuration to the cli, and introduces cognitive overhead for developers working on the try-state hooks needing to decide whether a check should be 'fast mode' or not.

I agree we should look to improve this, make it more intuitive and better documented. By default every test should be marked as fast test unless we notice they are taking really long to run in the CI. I do think though that running 99% of the try state checks on CI and ignoring one is better alternative than not running anything on CI. If there are complex changes to a pallet logic, a developer should always run full suite (may be we can have a bot command that runs the full suite on demand).

@Ank4n Ank4n requested a review from gpestana June 5, 2023 13:31
Copy link
Contributor

@gpestana gpestana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

@@ -306,9 +308,12 @@ pub trait Hooks<BlockNumber> {
/// It should focus on certain checks to ensure that the state is sensible. This is never
/// executed in a consensus code-path, therefore it can consume as much weight as it needs.
///
/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Takes the block number and `TryStateSelect`as a parameter. The `TryStateSelect` is used to
/// Takes the block number and `TryStateSelect` as a parameter. The `TryStateSelect` is used to

@@ -550,7 +550,7 @@ impl ExtBuilder {
let mut ext = self.build();
ext.execute_with(test);
ext.execute_with(|| {
Staking::do_try_state(System::block_number()).unwrap();
Staking::do_try_state(System::block_number(), false).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it would be more clear for the reader to pass one of [TryStateSelect::Fast, TryStateSelect::All] here instead of bool.

@juangirini juangirini added the T1-runtime This PR/Issue is related to the topic “runtime”. label Jun 12, 2023
Copy link
Contributor

@kianenigma kianenigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly think usage of TryStateSelect is not correct here.

fn try_state(_n: BlockNumberFor<T>) -> Result<(), TryRuntimeError> {
fn try_state(
_n: BlockNumberFor<T>,
_s: frame_support::traits::TryStateSelect,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is unfortunately not correct.

You are right to pass in something into the hook that helps it understand if it is fast or not, but TryStateSelect is not the right type here.

TryStateSelect is meant to identify which pallets to execute, not how much time they should each consume. It is only interpreted by the Executive and should not be exposed to the end user here at all.

What we instead want is a TryStateSpeed which you can for now assume to bool or enum TryStateSpeed { Slow, Mid, Fast }.

Then, you are capable of selecting which pallets to run, and at what speed.

As it stands now, I see this flaw as welll:

A pallet could possibly see RoundRobin(7) as its try-state select. How should it interpret this? Answer: it cannot, because it is not the right audience for it.

@Ank4n
Copy link
Contributor Author

Ank4n commented Jun 12, 2023

bot rebase

@paritytech-processbot
Copy link

Rebased

@stale
Copy link

stale bot commented Jul 12, 2023

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the A3-stale label Jul 12, 2023
@Ank4n
Copy link
Contributor Author

Ank4n commented Jul 13, 2023

Hey, is anyone still working on this? Due to the inactivity this issue has been automatically marked as stale. It will be closed if no further activity occurs. Thank you for your contributions.

Yes, will be reworking on this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A3-in_progress Pull request is in progress. No review needed at this stage. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D2-notlive 💤 PR contains changes in a runtime directory that is not deployed to a chain that requires an audit. T1-runtime This PR/Issue is related to the topic “runtime”.
Projects
Status: ✂️ In progress.
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Staking try_state times out
5 participants