Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domain snap sync algorithm #3027

Closed
wants to merge 13 commits into from

Conversation

shamil-gadelshin
Copy link
Member

@shamil-gadelshin shamil-gadelshin commented Sep 16, 2024

This PR introduces several important pieces for domain snap sync implementation (#3026): snap sync orchestrator and domain snap sync algorithm. This PR continues the discussion on the algorithm implementation highlighting the current decisions. It lacks the final integration in the code both at consensus and domain chain sides, proper configuration changes, and several security guarantees discussed previously.

The first commit introduces the SnapSyncOrchestrator - a synchronization manager that arranges correctly processes in both consensus and domain chains. 2-4 commits modify the existing code and add a structure to pass to the domain snap-sync algorithm introduced in the commit 5. The last (6) commit has an updated Cargo.lock placed separately to simplify the review.

Known future algorithm changes

  • change state block acquisition from downloading from remote peers to local derivation from consensus block
  • simple acquisition of the last confirmed domain block execution receipt must be replaced with the correct consensus protocol similar to segment headers acquisition
  • MMR data should be verified against the state (MMR roots)
  • introduce changes for MMR sync algorithm to work with "five segments lag" (see commit 3 and FINALIZATION_DEPTH_IN_SEGMENTS constant): modify MMR gadget to use archived blocks instead of finalized blocks.

Code contributor checklist:

Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but I don’t understand the snap sync algorithm enough to fully review it.

Cargo.lock Outdated Show resolved Hide resolved
domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved
Base automatically changed from modify-mmr-sync to main September 17, 2024 09:24
Copy link
Contributor

@teor2345 teor2345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, I don't think I'm familiar enough with the algorithm to approve, but it looks reasonable to me.

crates/subspace-service/src/sync_from_dsn/snap_sync.rs Outdated Show resolved Hide resolved
}

/// Wait for the allowing signal for the consensus chain snap sync.
pub async fn consensus_snap_sync_unblocked(&self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SubspaceLink::block_importing_notification_stream()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIR, we discussed offline a possibility of changing the current algorithm with sync orchestrator from blocking to reactive approach by utilizing SubspaceLink::block_importing_notification_stream() and its ability to acknowledge blocks. I tried to use that approach and deleted several initial synchronization points instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not deleting this one then? My point was that we ideally wouldn't need this orchestrator at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consensus chain snap sync is a part of the more complex domain snap sync process. In this form, it must start after we acquire the correct target block. I removed other blocking orchestrator points after our conversation: for example, we don't need to send signals from consensus snap sync anymore, however, it's not clear to me how to remove the dependency completely. Let's wait until the full solution is merged and return to this, I'm open for a change here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just have a mutex or oneshot channel or something that is passed down to subspace-service that prevents sync as such from starting? I don't think you need to block/unblock it many times anyway, just pause until something happens on domain side and it is not necessarily specific to Snap sync either.

What else is this orchestrator needed for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation will contain the target block provider that conceals the orchestrator in the full version:

pub trait SnapSyncTargetBlockProvider: Send + Sync {
    async fn target_block(&self) -> Option<BlockNumber>;
}

The default non-blocking implementation returns None, which is close to what you proposed. I tried to limit the scope of the PR, but it seems the whole solution will provide more context and will be easier to review despite its size.

Copy link
Member

@NingLin-P NingLin-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense in general, will take another look.

network_request,
)?;

let last_block_from_sync_result = sync_engine.download_state().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be possible that the state is too large to fit into the memory and cause OOM, not an immediate issue to fix but we need to resolve it in the long term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an upstream issue for this: paritytech/polkadot-sdk#4

domains/client/domain-operator/src/snap_sync.rs Outdated Show resolved Hide resolved
Comment on lines +220 to +224
if last_confirmed_block_receipt.domain_block_number == 0u32.into() {
return Err(sp_blockchain::Error::Application(
"Can't snap sync from genesis.".into(),
));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen if the domain is instantiated but has produced less than 14_400 blocks, especially when we enable the domain in the mainnet phase 2, the operator node will have to use full sync and sync from genesis.

Is it possible to avoid this error? by either downloading the genesis state from other peers or deriving the genesis state from the consensus chain (after consensus sync is finished) as the domain bootstrapper currently does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's not possible to insert the state after the initialization at the moment. However, we could try to investigate a hybrid "consensus-snap and domain-full" mode for such a case. Further research is required.

@shamil-gadelshin shamil-gadelshin marked this pull request as ready for review October 3, 2024 17:07
}

/// Wait for the allowing signal for the consensus chain snap sync.
pub async fn consensus_snap_sync_unblocked(&self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not deleting this one then? My point was that we ideally wouldn't need this orchestrator at all.

Comment on lines +328 to +331
// Skip last `FINALIZATION_DEPTH_IN_SEGMENTS` archived segments
.and_then(|max_segment_index| {
max_segment_index.checked_sub(FINALIZATION_DEPTH_IN_SEGMENTS)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed that it is not actually needed to download older segment at FINALIZATION_DEPTH_IN_SEGMENTS because it is equivalent to downloading the latest segment if responder can do a little bit of custom logic composing necessary data from technically not yet "finalized" from Substrate's point of view blocks. What happened to that?

Copy link
Member Author

@shamil-gadelshin shamil-gadelshin Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my current task and I will issue a separate PR similar to other MMR-sync updates.,

}

/// Wait for the allowing signal for the consensus chain snap sync.
pub async fn consensus_snap_sync_unblocked(&self) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just have a mutex or oneshot channel or something that is passed down to subspace-service that prevents sync as such from starting? I don't think you need to block/unblock it many times anyway, just pause until something happens on domain side and it is not necessarily specific to Snap sync either.

What else is this orchestrator needed for?

@shamil-gadelshin
Copy link
Member Author

Superseded by #3115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants