Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

do not merge - CI warp sync test #12751

Closed
wants to merge 74 commits into from
Closed

Conversation

michalkucharczyk
Copy link
Contributor

  • babe: allow skipping epochs in pallet
  • babe: detect and skip epochs on client
  • babe: cleaner epoch util functions
  • babe: add test for runtime handling of skipped epochs
  • babe: simpler implementation of client handling of skipped epochs
  • babe: test client-side handling of skipped epochs
  • babe: add comments on client-side skipped epochs
  • babe: remove emptyline
  • babe: make it resilient to forks
  • babe: typo
  • babe: overflow-safe math
  • babe: add test for skipping epochs across different forks
  • zombienet: warp-sync integration test added
  • spelling
  • Readme corrected
  • dir name updated
  • Check second phase of warp sync
  • zombienet pipeline enable + naive test network
  • zombienet stage added
  • paritypr/substrate-debug image added for zombienet testing
  • debugs added
  • debugs added
  • buildah problem fixed
  • rollback
  • runner tag
  • test name corrected
  • dir renamed (regex problem)
  • common code clean up
  • common code clean up
  • fix
  • warp sync test improvements
  • full sha used
  • disable tracing for nodes
  • COMMON_USER -> DOCKERIO_USER
  • refs reworked
  • paritypr/substrate image used
  • DOCKERIO -> DOCKER
  • generate-ws-db toml cleanup
  • improvements
  • fix
  • zombienet: warp sync test enabled

@michalkucharczyk
Copy link
Contributor Author

bot rebase

@paritytech-processbot
Copy link

Rebased

@davxy
Copy link
Member

davxy commented Nov 26, 2022

So, I investigated a bit why we are receiving the following error after a warp sync.

2022-11-26 10:44:22.488  WARN tokio-runtime-worker sync: 💔 Verification failed for block 0x2babfbf24066fa1b420934090c6472bf57f9b16a91fddf7ba3659ef7ad82449b received from peer: 12D3KooWPKzmmE2uYgF3z13xjpbFTp63g9dZFag8pG6MgnpSLF4S, "Could not fetch epoch at 0x5448eda4b7963ee85d28ea523149317dc39ce41173e198126b7ba67b7e123854"    

I'm going to recap the scenario of the test:

  • last block in the backup: hash: 0x5448, number: 12133
  • "NOW": ~ the slot corresponding to the time when we are running the tests

After warp sync the node starts recovering the old blocks and, when are found, epoch changes entries are stored in the GAP.

if let Some(gap) = &mut self.gap {
if let PersistedEpoch::Regular(e) = epoch {
epoch = match gap.import(slot, hash, number, e) {
Ok(()) => return Ok(()),
Err(e) => PersistedEpoch::Regular(e),
}
}
} else if epoch.is_genesis() &&
!self.epochs.is_empty() &&
!self.epochs.values().any(|e| e.is_genesis())
{
// There's a genesis epoch imported when we already have an active epoch.
// This happens after the warp sync as the ancient blocks download start.
// We need to start tracking gap epochs here.
self.gap = Some(GapEpochs { current: (hash, number, epoch), next: None });
return Ok(())
}


So far so good, until we hit the import of block number 12134 (one generated by the authorities "NOW").

The BABE's verification procedure tries to fetch the epoch changes entry for child of 0x5448 using the slot found in the pre-digest of the new imported block (i.e. slot corresponding to "NOW").

let epoch_descriptor = epoch_changes
.epoch_descriptor_for_child_of(
descendent_query(&*self.client),
&parent_hash,
parent_header_metadata.number,
pre_digest.slot(),
)

These are the values passed to epoch_descriptor_for_child_of():

hash = 0x5448eda4b7963ee85d28ea523149317dc39ce41173e198126b7ba67b7e123854
number = 12133
slot = 556485271

As a recap, when we lookup an entry for a (hash, slot) couple:

  • First we search in the GAP. Here we require that, for an epoch_entry in the gap, that epoch_entry.start_slot <= slot AND slot < epoch_entry.end_slot. See here
  • If the gap search fails then we search in the FORK-TREE: we take the deepest epoch_entry in the tree that has been signaled/announced by a block that is an ancestor of hash and that satisfies the predicate epoch_entry.start_slot <= slot. this strategy works even if we have skipped epochs

Now, because after a warp sync, the epoch entry usable by block 0x5448 is in the GAP AND we are using a slot computed using "NOW" while the entry in the gap has an old slot value, the lookup fails.


Notice that the slot is set to ~"NOW" (and NOT the time of creation of db, which is older).

It searches in the epoch changes tree GAP entry first.

This is the GAP at the moment of the search:

GapEpochs {
  current: (
    0x6dee8c1cfda34037d210521730d0cb39f4d04ad85680407453c1de88e55227b3,
    12121,
    Regular(Epoch {
      epoch_index: 1213,
      start_slot: Slot(556021772),
      duration: 10,
      ... not relevant
    })    
  ),
  next: Some((
    0xaea8b6dae1f008987ee4b6245fce4fa52fc7a3311a69d380514edc5598f9ba77,
    12131,
    Epoch {
      epoch_index: 1214,
      start_slot: Slot(556021782),
      duration: 10,
      ... not relevant
    }
  ))    
}    

At this point, the GAP entry to be used for 0x5448 is next, but unfortunately the GAP lookup method has a
very strict check over the slot value.

That is, the slot we are passing should be within the start and start+duration of an entry in the gap.
Since we are using some slot ~"NOW" this is not satisfied.

The search then proceeds in the FORK-TREE, that fails because the entry is not there.


Potential solution

(Not tested idea)

In the GAP lookup method:

  1. start search from the older entry first
  2. remove the endpoint constraint, and check only that entry.start_slot <= slot (similar to the strategy used by the tree search)
  3. together with the maybe found entry return a bool to signal if a "not exact" match has been found (i.e. if slot >= entry.end_slot)

Finally, if the gap match is not exact or None, try to check if there is a better match using the fork tree.

Return the best thing found.

@michalkucharczyk
Copy link
Contributor Author

I don't know exact reasons to have the dedicated GapEpochs. If I understand correctly it is simplified and optimized version of the ForkTree.

One solution might also be to try to get rid of GapEpochs and use ForkTree instead of this. This would allow to have exactly one implementation of the logic behind finding the right epochs changes announcement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants