Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Add missing fields to the light sync state #7225

Merged
18 commits merged into from
Oct 15, 2020
Merged

Conversation

expenses
Copy link
Contributor

@expenses expenses commented Sep 28, 2020

Part of #6804.

polkadot companion: paritytech/polkadot#1801

@github-actions github-actions bot added the A3-in_progress Pull request is in progress. No review needed at this stage. label Sep 28, 2020

Ok(sc_chain_spec::LightSyncState {
header
finalized_block_header: finalized_header,
babe_finalized_block1_slot_number: 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a way to get this as well.

@expenses expenses added B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Sep 28, 2020
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is the way to go. I'll try to provide context on what the problem is and how we should go about fixing it.

Both BABE and GRANDPA maintain some data structures relating to the management of epochs (i.e. periods in which a given set of authorities is active). There's no need to go into details on the logic behind those data structures, but the key point is that they are populated as part of processing blocks. The data structures are:

In the scope of the light client, where we skip an initial sync and start from a block defined in the chainspec, this means that these data structures will not be populated with any data. As such, both BabeBlockImport and GrandpaBlockImport will fail to validate subsequent blocks (e.g. the errors you were seeing regarding BABE not finding epoch data). The solution is that we need to persist the data that exists in EpochChanges and AuthoritySet when we are creating the "light sync state", which we can then use when restoring from that state.

IMO as an initial implementation we can start by just dumping the data structures completely and persisting the whole thing (they are scale-encodable as we are already persisting them to the database). Once that is done we can prune these data structures to remove any unnecessary data (so that we don't persist everything). Both the data structures are easily available as part of service initialization: https://github.com/paritytech/substrate/blob/master/bin/node/cli/src/service.rs#L120 and https://github.com/paritytech/substrate/blob/master/bin/node/cli/src/service.rs#L112.

@andresilva
Copy link
Contributor

andresilva commented Sep 29, 2020

We could persist all the data that is needed in separate fields (as it seems the intention in this PR so far), but we'd still need to create some EpochChanges and AuthoritySet data structures with that data afterwards. Given that, I think it's easier to just persist EpochChanges and AuthoritySet to start with.

@expenses
Copy link
Contributor Author

@andresilva done. The sync state increases the size of the chain spec by about 700kb, but that's not too bad that it's 3.8mb to begin with.

@expenses
Copy link
Contributor Author

@andresilva How should we be getting the babe slot number for the first block? I thought that you'd be able to get this from the epoch changes, but you can't because it gets pruned.

@andresilva
Copy link
Contributor

@andresilva done. The sync state increases the size of the chain spec by about 700kb, but that's not too bad that it's 3.8mb to begin with.

Yeah, we'll have at least two lists of authority sets with public keys in there (BABE and GRANDPA). We might be able to prune some of the data you're seeing though (but let's worry about this later).

@andresilva How should we be getting the babe slot number for the first block? I thought that you'd be able to get this from the epoch changes, but you can't because it gets pruned.

We can get it from BABE's runtime module https://github.com/paritytech/substrate/blob/master/frame/babe/src/lib.rs#L169. But do we need it? I know that Pierre mentioned this in the data we need to persist, but if we are persisting the latest epoch data I'm not sure we need this. At least the client code in substrate does not use this in any way, I believe this would only be needed in the substrate-lite implementation.

@expenses
Copy link
Contributor Author

@andresilva we're trying to support substrate-lite here, but @tomaka if we don't use it in substrate, why is it needed in substrate-lite?

@expenses
Copy link
Contributor Author

Also @tomaka am I right in thinking that babe_finalized_block_epoch_information is infomation about the epoch that contains the finalized header? We won't be able to get this from EpochChanges because it gets pruned once finalized.

@tomaka
Copy link
Contributor

tomaka commented Oct 1, 2020

We can get it from BABE's runtime module https://github.com/paritytech/substrate/blob/master/frame/babe/src/lib.rs#L169. But do we need it? I know that Pierre mentioned this in the data we need to persist, but if we are persisting the latest epoch data I'm not sure we need this.

I use it in substrate-lite to determine at which slot an epoch starts and ends.
The slot number of block 1 is the slot at which epoch 0 starts. Then epoch 1 starts at slot_of_block_1 + slots_per_epoch, and so on. I can calculate when epoch N starts and ends just by knowing N and slot_of_block_1.
I guess that in Substrate the start slot number is stored in the epoch's information?

Also @tomaka am I right in thinking that babe_finalized_block_epoch_information is infomation about the epoch that contains the finalized header?

Indeed. If a block uses the last slot of an epoch, we know that the next block will always be a different epoch. However, since not all slots necessarily have a block, it is possible for a block to not use the last slot of an epoch, yet the next block is in a different epoch anyway.

@expenses
Copy link
Contributor Author

expenses commented Oct 1, 2020

I guess that in Substrate the start slot number is stored in the epoch's information?

Yep: https://substrate.dev/rustdocs/v2.0.0-rc6/sc_consensus_babe/struct.Epoch.html#structfield.start_slot

@tomaka
Copy link
Contributor

tomaka commented Oct 1, 2020

This is becoming off-topic, but I would need to learn a bit more about the long term plan concerning Babe slots. I know we're supposed to have some small NTP-like protocol, or something like that, but I don't know the details.

Using the slot of block 1 also assumes that the number of slots per epoch never changes. This is true at the moment, but maybe we'll add this possibility in the future?

@tomaka
Copy link
Contributor

tomaka commented Oct 1, 2020

Anyway, my point is that this PR should do the thing that makes sense, not the thing that brings compatibility with substrate-lite. It's substrate-lite which has to adjust. But it's unclear to me what makes sense.

@expenses
Copy link
Contributor Author

expenses commented Oct 1, 2020

Ok, I'm making progress getting the sync state to load in substrate (not lite). Now I'm getting:

Oct 01 15:10:15.402  WARN 💔 Error importing block 
0x3bf33ab015a8ae4ff09c8bc6e4361c82347cc87fefbba3478ee3b29cb0b71da1:
Err(Other(ClientImport("Parent block of 0x3bf3…1da1 has no associated weight")))

@andresilva Should we saving babe block weights into the sync state as well?

client/chain-spec/src/lib.rs Outdated Show resolved Hide resolved
client/service/src/builder.rs Outdated Show resolved Hide resolved
@expenses
Copy link
Contributor Author

@andresilva on ashley-improve-sync-state-WIP-loading im not getting a lot of this warning:

2020-10-12 14:25:18 💔 Error importing block 0x8ce0f445b3f00a3585b5571492cbda40bc04bb09f64179e0723628a6bf9448a7: Err(Other(ClientImport("Unexpected epoch change")))

@expenses
Copy link
Contributor Author

Ok, so I've updated ashley-improve-sync-state-WIP-loading to not generate CHTs in LightStorage::note_finalized and it now syncs the provided chain spec with a sync state for block #118347 to the head without panicking.

@expenses
Copy link
Contributor Author

expenses commented Oct 14, 2020

It is still panicking on a second chain spec on branch ashley-improve-sync-state-WIP-loading-panicking that starts from block #198161.

@expenses
Copy link
Contributor Author

This PR is ready and I'd like to get it merged soon. The two things to check in reviews are:

  1. Are we completely sure that the sync state contains everything we need?
  2. Is the way I've set up this RPC call correct?

Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a minor nit but overall lgtm.

@@ -53,7 +53,7 @@ impl<H, N> Clone for SharedAuthoritySet<H, N> {

impl<H, N> SharedAuthoritySet<H, N> {
/// Acquire a reference to the inner read-write lock.
pub(crate) fn inner(&self) -> &RwLock<AuthoritySet<H, N>> {
pub fn inner(&self) -> &RwLock<AuthoritySet<H, N>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can instead add a method clone_inner() or something like that, which clones the inner AuthoritySet and returns it? It is safer not to expose the lock outside this crate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@andresilva
Copy link
Contributor

andresilva commented Oct 15, 2020

I thought that there could only be one pending change at a time on the finalized chain?

There can be multiple pending changes on the unfinalized chain (if GRANDPA has stalled). On Kusama we will be rotating the session every hour, which potentially leads to an authority set change in GRANDPA. Since the changes are only enacted on finality, if GRANDPA stops working for 6h we can end up with 6 pending authority set changes all on the same branch (which we need to finalize in-order afterwards). Probably I didn't do a very good job at explaining you how this works previously (why I need to write some docs).

@tomaka
Copy link
Contributor

tomaka commented Oct 15, 2020

on the unfinalized chain

But this PR is only about returning the state of the finalized chain, no? A data structure containing the state of the finalized chain doesn't have to contain more than one pending change.

@expenses
Copy link
Contributor Author

@tomaka Well no, at the moment we just encode the whole AuthoritySet at the time of the RPC call. If finality is lagging behind significantly, then it will contain unfinalized pending changes. In the future we could trim this down.

@expenses
Copy link
Contributor Author

Ok, I'd like to get another approval and then I'll merge this.

@andresilva
Copy link
Contributor

@tomaka Well no, at the moment we just encode the whole AuthoritySet at the time of the RPC call. If finality is lagging behind significantly, then it will contain unfinalized pending changes. In the future we could trim this down.

To get something working I think it was easier to just persist BABE's EpochChanges and GRANDPA's AuthoritySet data structures. This is because we can then just blindly encode / decode and have something working. This has some disadvantages though: it includes unnecessary data and it's harder to integrate with other clients (e.g. substrate-lite doesn't use these data structures so cannot easily use these fields). I think the next step should be to trim them down to the bare essentials and in which case we will probably end up only persisting those pieces of data and not the complete data structure (e.g. for GRANDPA it would be the set id, current authorities and some pending standard change that may exist). But IMO this can be done in a follow-up PR.

tldr: the actual sync state data will probably change after this PR, IMO this is fine as we can get something working in the meantime without having to deal with all the details of what needs to be persisted.

@expenses
Copy link
Contributor Author

expenses commented Oct 15, 2020

tldr: the actual sync state data will probably change after this PR, IMO this is fine as we can get something working in the meantime without having to deal with all the details of what needs to be persisted.

As long as the types don't change, I'm okay with this.

Actually, I suppose that's fine.

@expenses
Copy link
Contributor Author

bot merge

@ghost
Copy link

ghost commented Oct 15, 2020

Trying merge.

@ghost ghost merged commit 65cc9af into master Oct 15, 2020
@ghost ghost deleted the ashley-improve-sync-state branch October 15, 2020 10:10
@andresilva
Copy link
Contributor

tldr: the actual sync state data will probably change after this PR, IMO this is fine as we can get something working in the meantime without having to deal with all the details of what needs to be persisted.

As long as the types don't change, I'm okay with this.

If we want to make it easier to be consumed by substrate-lite then we probably want to change the types (to "plain" data as you had initially). TBH I wasn't thinking about interoperability when I suggested persisting EpochChanges and AuthoritySet just thinking of the easiest / fastest way to get the sync state to work. But IMO if we're going to make the substrate light client work first with the sync state (and look into substrate-lite afterwards) then the current approach is OK since there are some unknowns and we will probably only be sure once we have the whole thing working properly.

finalized_block_header: finalized_header,
babe_epoch_changes: self.shared_epoch_changes.lock().clone(),
babe_finalized_block_weight: finalized_block_weight,
grandpa_authority_set: self.shared_authority_set.clone_inner(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be prune (grandpa set or babe epochs) concurrently ?
btw even if possible the race seems highly improbable (probably when code get generic there will be
a trait method like synch_state_at(block_hash) -> Option<S> for substrate component that require synch state and this will not be a possibillity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine since both BABE and GRANDPA will acquire a lock when writing to these data structures (e.g. if there's a block being imported).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe we should lock both structure before getting the finalized block hash? (it would make thing easier to read at least)

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants