Skip to content

Commit

Permalink
Add PVF execution priority (#4837)
Browse files Browse the repository at this point in the history
Resolves #4632

The new logic optimizes the distribution of execution jobs for disputes,
approvals, and backings. Testing shows improved finality lag and
candidate checking times, especially under heavy network load.

### Approach

This update adds prioritization to the PVF execution queue. The logic
partially implements the suggestions from
#4632 (comment).

We use thresholds to determine how much a current priority can "steal"
from lower ones:
-  Disputes: 70%
-  Approvals: 80%
-  Backing System Parachains: 100%
-  Backing: 100%

A threshold indicates the portion of the current priority that can be
allocated from lower priorities.

For example:
-  Disputes take 70%, leaving 30% for approvals and all backings.
- 80% of the remaining goes to approvals, which is 30% * 80% = 24% of
the original 100%.
- If we used parts of the original 100%, approvals couldn't take more
than 24%, even if there are no disputes.

Assuming a maximum of 12 executions per block, with a 6-second window, 2
CPU cores, and a 2-second run time, we get these distributions:

-  With disputes: 8 disputes, 3 approvals, 1 backing
-  Without disputes: 9 approvals, 3 backings

It's worth noting that when there are no disputes, if there's only one
backing job, we continue processing approvals regardless of their
fulfillment status.

### Versi Testing 40/20

Testing showed a slight difference in finality lag and candidate
checking time between this pull request and its base on the master
branch. The more loaded the network, the greater the observed
difference.

Testing Parameters:
-  40 validators (4 malicious)
-  20 gluttons with 2 seconds of PVF execution time
-  6 VRF modulo samples
-  12 required approvals

![Pasted Graphic
3](https://github.com/user-attachments/assets/8b6163a4-a1c9-44c2-bdba-ce1ef4b1eba7)
![Pasted Graphic
4](https://github.com/user-attachments/assets/9f016647-7727-42e8-afe9-04f303e6c862)

### Versi Testing 80/40

For this test, we compared the master branch with the branch from
#5616. The second branch
is based on the current one but removes backing jobs that have exceeded
their time limits. We excluded malicious nodes to reduce noise from
disputing and banning validators. The results show that, under the same
load, nodes experience less finality lag and reduced recovery and check
time. Even parachains are functioning with a shorter block time,
although it remains over 6 seconds.

Testing Parameters:
-  80 validators (0 malicious)
-  40 gluttons with 2 seconds of PVF execution time
-  6 VRF modulo samples
-  30 required approvals


![image](https://github.com/user-attachments/assets/42bcc845-9115-4ae3-9910-286b77a60bbf)

---------

Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
  • Loading branch information
AndreiEres and sandreim authored Oct 9, 2024
1 parent 90ff47d commit e294d62
Show file tree
Hide file tree
Showing 22 changed files with 539 additions and 72 deletions.
3 changes: 3 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 3 additions & 3 deletions polkadot/node/core/approval-voting/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ use polkadot_node_subsystem::{
ApprovalVotingMessage, AssignmentCheckError, AssignmentCheckResult,
AvailabilityRecoveryMessage, BlockDescription, CandidateValidationMessage, ChainApiMessage,
ChainSelectionMessage, CheckedIndirectAssignment, CheckedIndirectSignedApprovalVote,
DisputeCoordinatorMessage, HighestApprovedAncestorBlock, RuntimeApiMessage,
DisputeCoordinatorMessage, HighestApprovedAncestorBlock, PvfExecKind, RuntimeApiMessage,
RuntimeApiRequest,
},
overseer, FromOrchestra, OverseerSignal, SpawnedSubsystem, SubsystemError, SubsystemResult,
Expand All @@ -53,8 +53,8 @@ use polkadot_node_subsystem_util::{
};
use polkadot_primitives::{
ApprovalVoteMultipleCandidates, ApprovalVotingParams, BlockNumber, CandidateHash,
CandidateIndex, CandidateReceipt, CoreIndex, ExecutorParams, GroupIndex, Hash, PvfExecKind,
SessionIndex, SessionInfo, ValidatorId, ValidatorIndex, ValidatorPair, ValidatorSignature,
CandidateIndex, CandidateReceipt, CoreIndex, ExecutorParams, GroupIndex, Hash, SessionIndex,
SessionInfo, ValidatorId, ValidatorIndex, ValidatorPair, ValidatorSignature,
};
use sc_keystore::LocalKeystore;
use sp_application_crypto::Pair;
Expand Down
1 change: 1 addition & 0 deletions polkadot/node/core/backing/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ futures = { workspace = true }
sp-keystore = { workspace = true, default-features = true }
polkadot-primitives = { workspace = true, default-features = true }
polkadot-node-primitives = { workspace = true, default-features = true }
polkadot-parachain-primitives = { workspace = true, default-features = true }
polkadot-node-subsystem = { workspace = true, default-features = true }
polkadot-node-subsystem-util = { workspace = true, default-features = true }
polkadot-erasure-coding = { workspace = true, default-features = true }
Expand Down
17 changes: 12 additions & 5 deletions polkadot/node/core/backing/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,9 @@ use polkadot_node_subsystem::{
AvailabilityDistributionMessage, AvailabilityStoreMessage, CanSecondRequest,
CandidateBackingMessage, CandidateValidationMessage, CollatorProtocolMessage,
HypotheticalCandidate, HypotheticalMembershipRequest, IntroduceSecondedCandidateRequest,
ProspectiveParachainsMessage, ProvisionableData, ProvisionerMessage, RuntimeApiMessage,
RuntimeApiRequest, StatementDistributionMessage, StoreAvailableDataError,
ProspectiveParachainsMessage, ProvisionableData, ProvisionerMessage, PvfExecKind,
RuntimeApiMessage, RuntimeApiRequest, StatementDistributionMessage,
StoreAvailableDataError,
},
overseer, ActiveLeavesUpdate, FromOrchestra, OverseerSignal, SpawnedSubsystem, SubsystemError,
};
Expand All @@ -105,12 +106,13 @@ use polkadot_node_subsystem_util::{
},
Validator,
};
use polkadot_parachain_primitives::primitives::IsSystem;
use polkadot_primitives::{
node_features::FeatureIndex, BackedCandidate, CandidateCommitments, CandidateHash,
CandidateReceipt, CommittedCandidateReceipt, CoreIndex, CoreState, ExecutorParams, GroupIndex,
GroupRotationInfo, Hash, Id as ParaId, IndexedVec, NodeFeatures, PersistedValidationData,
PvfExecKind, SessionIndex, SigningContext, ValidationCode, ValidatorId, ValidatorIndex,
ValidatorSignature, ValidityAttestation,
SessionIndex, SigningContext, ValidationCode, ValidatorId, ValidatorIndex, ValidatorSignature,
ValidityAttestation,
};
use polkadot_statement_table::{
generic::AttestedCandidate as TableAttestedCandidate,
Expand Down Expand Up @@ -625,6 +627,7 @@ async fn request_candidate_validation(
executor_params: ExecutorParams,
) -> Result<ValidationResult, Error> {
let (tx, rx) = oneshot::channel();
let is_system = candidate_receipt.descriptor.para_id.is_system();

sender
.send_message(CandidateValidationMessage::ValidateFromExhaustive {
Expand All @@ -633,7 +636,11 @@ async fn request_candidate_validation(
candidate_receipt,
pov,
executor_params,
exec_kind: PvfExecKind::Backing,
exec_kind: if is_system {
PvfExecKind::BackingSystemParas
} else {
PvfExecKind::Backing
},
response_sender: tx,
})
.await;
Expand Down
22 changes: 11 additions & 11 deletions polkadot/node/core/backing/src/tests/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ use polkadot_node_subsystem::{
use polkadot_node_subsystem_test_helpers as test_helpers;
use polkadot_primitives::{
node_features, CandidateDescriptor, GroupRotationInfo, HeadData, PersistedValidationData,
PvfExecKind, ScheduledCore, SessionIndex, LEGACY_MIN_BACKING_VOTES,
ScheduledCore, SessionIndex, LEGACY_MIN_BACKING_VOTES,
};
use polkadot_primitives_test_helpers::{
dummy_candidate_receipt_bad_sig, dummy_collator, dummy_collator_signature,
Expand Down Expand Up @@ -434,7 +434,7 @@ async fn assert_validate_from_exhaustive(
) if validation_data == *assert_pvd &&
validation_code == *assert_validation_code &&
*pov == *assert_pov && &candidate_receipt.descriptor == assert_candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_receipt.commitments_hash == assert_candidate.commitments.hash() =>
{
response_sender.send(Ok(ValidationResult::Valid(
Expand Down Expand Up @@ -651,7 +651,7 @@ fn backing_works(#[case] elastic_scaling_mvp: bool) {
) if validation_data == pvd_ab &&
validation_code == validation_code_ab &&
*pov == pov_ab && &candidate_receipt.descriptor == candidate_a.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_receipt.commitments_hash == candidate_a_commitments_hash =>
{
response_sender.send(Ok(
Expand Down Expand Up @@ -1287,7 +1287,7 @@ fn backing_works_while_validation_ongoing() {
) if validation_data == pvd_abc &&
validation_code == validation_code_abc &&
*pov == pov_abc && &candidate_receipt.descriptor == candidate_a.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_a_commitments_hash == candidate_receipt.commitments_hash =>
{
// we never validate the candidate. our local node
Expand Down Expand Up @@ -1454,7 +1454,7 @@ fn backing_misbehavior_works() {
) if validation_data == pvd_a &&
validation_code == validation_code_a &&
*pov == pov_a && &candidate_receipt.descriptor == candidate_a.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_a_commitments_hash == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(
Expand Down Expand Up @@ -1621,7 +1621,7 @@ fn backing_dont_second_invalid() {
) if validation_data == pvd_a &&
validation_code == validation_code_a &&
*pov == pov_block_a && &candidate_receipt.descriptor == candidate_a.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_a.commitments.hash() == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(ValidationResult::Invalid(InvalidCandidate::BadReturn))).unwrap();
Expand Down Expand Up @@ -1661,7 +1661,7 @@ fn backing_dont_second_invalid() {
) if validation_data == pvd_b &&
validation_code == validation_code_b &&
*pov == pov_block_b && &candidate_receipt.descriptor == candidate_b.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_b.commitments.hash() == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(
Expand Down Expand Up @@ -1788,7 +1788,7 @@ fn backing_second_after_first_fails_works() {
) if validation_data == pvd_a &&
validation_code == validation_code_a &&
*pov == pov_a && &candidate_receipt.descriptor == candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate.commitments.hash() == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(ValidationResult::Invalid(InvalidCandidate::BadReturn))).unwrap();
Expand Down Expand Up @@ -1932,7 +1932,7 @@ fn backing_works_after_failed_validation() {
) if validation_data == pvd_a &&
validation_code == validation_code_a &&
*pov == pov_a && &candidate_receipt.descriptor == candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate.commitments.hash() == candidate_receipt.commitments_hash =>
{
response_sender.send(Err(ValidationFailed("Internal test error".into()))).unwrap();
Expand Down Expand Up @@ -2211,7 +2211,7 @@ fn retry_works() {
) if validation_data == pvd_a &&
validation_code == validation_code_a &&
*pov == pov_a && &candidate_receipt.descriptor == candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate.commitments.hash() == candidate_receipt.commitments_hash
);
virtual_overseer
Expand Down Expand Up @@ -2753,7 +2753,7 @@ fn validator_ignores_statements_from_disabled_validators() {
) if validation_data == pvd &&
validation_code == expected_validation_code &&
*pov == expected_pov && &candidate_receipt.descriptor == candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate_commitments_hash == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ async fn assert_validate_seconded_candidate(
&validation_code == assert_validation_code &&
&*pov == assert_pov &&
&candidate_receipt.descriptor == candidate.descriptor() &&
exec_kind == PvfExecKind::Backing &&
exec_kind == PvfExecKind::BackingSystemParas &&
candidate.commitments.hash() == candidate_receipt.commitments_hash =>
{
response_sender.send(Ok(ValidationResult::Valid(
Expand Down
41 changes: 27 additions & 14 deletions polkadot/node/core/candidate-validation/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ use polkadot_node_primitives::{InvalidCandidate, PoV, ValidationResult};
use polkadot_node_subsystem::{
errors::RuntimeApiError,
messages::{
CandidateValidationMessage, PreCheckOutcome, RuntimeApiMessage, RuntimeApiRequest,
ValidationFailed,
CandidateValidationMessage, PreCheckOutcome, PvfExecKind, RuntimeApiMessage,
RuntimeApiRequest, ValidationFailed,
},
overseer, FromOrchestra, OverseerSignal, SpawnedSubsystem, SubsystemError, SubsystemResult,
SubsystemSender,
Expand All @@ -46,8 +46,9 @@ use polkadot_primitives::{
DEFAULT_LENIENT_PREPARATION_TIMEOUT, DEFAULT_PRECHECK_PREPARATION_TIMEOUT,
},
AuthorityDiscoveryId, CandidateCommitments, CandidateDescriptor, CandidateEvent,
CandidateReceipt, ExecutorParams, Hash, PersistedValidationData, PvfExecKind, PvfPrepKind,
SessionIndex, ValidationCode, ValidationCodeHash, ValidatorId,
CandidateReceipt, ExecutorParams, Hash, PersistedValidationData,
PvfExecKind as RuntimePvfExecKind, PvfPrepKind, SessionIndex, ValidationCode,
ValidationCodeHash, ValidatorId,
};
use sp_application_crypto::{AppCrypto, ByteArray};
use sp_keystore::KeystorePtr;
Expand Down Expand Up @@ -667,9 +668,9 @@ async fn validate_candidate_exhaustive(
let result = match exec_kind {
// Retry is disabled to reduce the chance of nondeterministic blocks getting backed and
// honest backers getting slashed.
PvfExecKind::Backing => {
PvfExecKind::Backing | PvfExecKind::BackingSystemParas => {
let prep_timeout = pvf_prep_timeout(&executor_params, PvfPrepKind::Prepare);
let exec_timeout = pvf_exec_timeout(&executor_params, exec_kind);
let exec_timeout = pvf_exec_timeout(&executor_params, exec_kind.into());
let pvf = PvfPrepData::from_code(
validation_code.0,
executor_params,
Expand All @@ -683,20 +684,22 @@ async fn validate_candidate_exhaustive(
exec_timeout,
persisted_validation_data.clone(),
pov,
polkadot_node_core_pvf::Priority::Normal,
exec_kind.into(),
exec_kind,
)
.await
},
PvfExecKind::Approval =>
PvfExecKind::Approval | PvfExecKind::Dispute =>
validation_backend
.validate_candidate_with_retry(
validation_code.0,
pvf_exec_timeout(&executor_params, exec_kind),
pvf_exec_timeout(&executor_params, exec_kind.into()),
persisted_validation_data.clone(),
pov,
executor_params,
PVF_APPROVAL_EXECUTION_RETRY_DELAY,
polkadot_node_core_pvf::Priority::Critical,
exec_kind.into(),
exec_kind,
)
.await,
};
Expand Down Expand Up @@ -784,6 +787,8 @@ trait ValidationBackend {
pov: Arc<PoV>,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
// The kind for the execution job.
exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError>;

/// Tries executing a PVF. Will retry once if an error is encountered that may have
Expand All @@ -804,6 +809,8 @@ trait ValidationBackend {
retry_delay: Duration,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
// The kind for the execution job.
exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError> {
let prep_timeout = pvf_prep_timeout(&executor_params, PvfPrepKind::Prepare);
// Construct the PVF a single time, since it is an expensive operation. Cloning it is cheap.
Expand All @@ -825,6 +832,7 @@ trait ValidationBackend {
pvd.clone(),
pov.clone(),
prepare_priority,
exec_kind,
)
.await;
if validation_result.is_ok() {
Expand Down Expand Up @@ -905,6 +913,7 @@ trait ValidationBackend {
pvd.clone(),
pov.clone(),
prepare_priority,
exec_kind,
)
.await;
}
Expand All @@ -929,9 +938,13 @@ impl ValidationBackend for ValidationHost {
pov: Arc<PoV>,
// The priority for the preparation job.
prepare_priority: polkadot_node_core_pvf::Priority,
// The kind for the execution job.
exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError> {
let (tx, rx) = oneshot::channel();
if let Err(err) = self.execute_pvf(pvf, exec_timeout, pvd, pov, prepare_priority, tx).await
if let Err(err) = self
.execute_pvf(pvf, exec_timeout, pvd, pov, prepare_priority, exec_kind, tx)
.await
{
return Err(InternalValidationError::HostCommunication(format!(
"cannot send pvf to the validation host, it might have shut down: {:?}",
Expand Down Expand Up @@ -1023,12 +1036,12 @@ fn pvf_prep_timeout(executor_params: &ExecutorParams, kind: PvfPrepKind) -> Dura
/// This should be much longer than the backing execution timeout to ensure that in the
/// absence of extremely large disparities between hardware, blocks that pass backing are
/// considered executable by approval checkers or dispute participants.
fn pvf_exec_timeout(executor_params: &ExecutorParams, kind: PvfExecKind) -> Duration {
fn pvf_exec_timeout(executor_params: &ExecutorParams, kind: RuntimePvfExecKind) -> Duration {
if let Some(timeout) = executor_params.pvf_exec_timeout(kind) {
return timeout
}
match kind {
PvfExecKind::Backing => DEFAULT_BACKING_EXECUTION_TIMEOUT,
PvfExecKind::Approval => DEFAULT_APPROVAL_EXECUTION_TIMEOUT,
RuntimePvfExecKind::Backing => DEFAULT_BACKING_EXECUTION_TIMEOUT,
RuntimePvfExecKind::Approval => DEFAULT_APPROVAL_EXECUTION_TIMEOUT,
}
}
4 changes: 4 additions & 0 deletions polkadot/node/core/candidate-validation/src/tests.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
use std::sync::atomic::{AtomicUsize, Ordering};

use super::*;
use crate::PvfExecKind;
use assert_matches::assert_matches;
use futures::executor;
use polkadot_node_core_pvf::PrepareError;
Expand Down Expand Up @@ -441,6 +442,7 @@ impl ValidationBackend for MockValidateCandidateBackend {
_pvd: Arc<PersistedValidationData>,
_pov: Arc<PoV>,
_prepare_priority: polkadot_node_core_pvf::Priority,
_exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError> {
// This is expected to panic if called more times than expected, indicating an error in the
// test.
Expand Down Expand Up @@ -1023,6 +1025,7 @@ impl ValidationBackend for MockPreCheckBackend {
_pvd: Arc<PersistedValidationData>,
_pov: Arc<PoV>,
_prepare_priority: polkadot_node_core_pvf::Priority,
_exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError> {
unreachable!()
}
Expand Down Expand Up @@ -1177,6 +1180,7 @@ impl ValidationBackend for MockHeadsUp {
_pvd: Arc<PersistedValidationData>,
_pov: Arc<PoV>,
_prepare_priority: polkadot_node_core_pvf::Priority,
_exec_kind: PvfExecKind,
) -> Result<WasmValidationResult, ValidationError> {
unreachable!()
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,13 +27,11 @@ use futures_timer::Delay;

use polkadot_node_primitives::ValidationResult;
use polkadot_node_subsystem::{
messages::{AvailabilityRecoveryMessage, CandidateValidationMessage},
messages::{AvailabilityRecoveryMessage, CandidateValidationMessage, PvfExecKind},
overseer, ActiveLeavesUpdate, RecoveryError,
};
use polkadot_node_subsystem_util::runtime::get_validation_code_by_hash;
use polkadot_primitives::{
BlockNumber, CandidateHash, CandidateReceipt, Hash, PvfExecKind, SessionIndex,
};
use polkadot_primitives::{BlockNumber, CandidateHash, CandidateReceipt, Hash, SessionIndex};

use crate::LOG_TARGET;

Expand Down Expand Up @@ -387,7 +385,7 @@ async fn participate(
candidate_receipt: req.candidate_receipt().clone(),
pov: available_data.pov,
executor_params: req.executor_params(),
exec_kind: PvfExecKind::Approval,
exec_kind: PvfExecKind::Dispute,
response_sender: validation_tx,
})
.await;
Expand Down
Loading

0 comments on commit e294d62

Please sign in to comment.