Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Updating Nimbus SessionKey is ignored. Client continues proposing new blocks with the old key on runtime upgrade #75

Open
Garandor opened this issue Aug 12, 2022 · 7 comments

Comments

@Garandor
Copy link
Contributor

Garandor commented Aug 12, 2022

I'm seeing that after calling setKeys to update the node's Nimbus key, the node continues proposing with the old SessionKey.

Our implementation of AccountLookup uses pallet_session::key_owner to map NimbusId -> AccountId, so the Nimbus key must match what's set in pallet_session::queuedKeys for the node to produce valid blocks.

Because of this, our polkadot-launch --chain parachain-local nodes can not start block production after a chain upgrade that enables nimbus on a chain that didn't use it before

Expected Results

The node queries the current nimbus key from the runtime and uses that to author

Analysis

The fn the client uses to build the digest is nimbus_consensus::seal_header nimbus-consensus/src/lib.rs which calls SyncCryptoStore::sign_with

Rustdocs of sign_with state

Given a list of public keys, find the first supported key and sign the provided message with that key.

Now the problem is that the two ways to change your node's session keys - author_rotateKeys and author_insertKey both internally call SessionKeys::generate_session_keys which adds the new keys to the Crypto store, but does NOT remove the old keys.

So after running one of the above, there will be multiple keys in store matching the nmbs session key type and Nimbus Consensus implicitly picks the first one to propose blocks, which then get rejected by the runtime as they don't match the key set in pallet_session::queuedKeys, preventing the node from producing blocks.

Workaround

  1. Stop the node
  2. Find the keystore folder and manually delete all old nimbus key that don't match what you have provided to the most recent call of setKeys. Nimbus keys start with 6e6d6273 ( nmbs in hex ) and have the public key after that.
  3. Restart the node

This is extra problematic on polkadot-launch --chain something-local deployments with the --alice etc. nodes as their well-known keys are not held in the keystore on the filesystem and thus can not be deleted.

@JoshOrndorff
Copy link
Contributor

The client will author with the first eligible session key it finds. If a single client has two eligible session keys in its keystore, there is no guarantee which one it will try first.

I've bolded the word elegible in the previous paragraph, because the client calls a runtime api to ask the runtime whether a particular session key is eligible before authoring with it. You could, and probably should, make your runtime logic such that when you insert a new session key, it removes the old one, so they are not both eligible. For example, moonbeam's pallet author mapping has this extrinsic.

@JoshOrndorff
Copy link
Contributor

Nimbus Consensus implicitly picks the first one to propose blocks, which then get rejected by the runtime

This should not happen in normal operation. If you are using --force-authoring, however, the eligibility check that I mentioned above will be skipped, and the literal first nmbs key found will be used regardless of whether it is eligible.

@Garandor
Copy link
Contributor Author

Garandor commented Aug 12, 2022

Oh, I see the code you mentioned. It's
maybe_key = if self.skip_prediction || runtime_upgraded { first_available_key(&*self.keystore) }

The runtime_upgraded part is from 3 months ago @librelois b54f30e

Now the trouble we're running into is that we have the runtime_upgraded case, where before the breaking RT upgrade our collators run insert_key to add a nimbus key to the keystore and setKeys to make them eligible to produce right after the upgrade.

During testing with a --alice node, restarting the chain after upgrading the client fails because the node uses the default //alice key to propose though only the key added with insert_key is eligible.

In the special case of well-known keys, there is no way to remove the alice key from the keystore ( because it's not stored in the keystore folder and thus prevent the old key from being used in the above logic

I expect the same problem to happen with the live chain, if community members run insert_keys 2 or more times.
Though they mark only one of these keys as eligible using pallet_session::setKeys, it will be up to random chance of whether the second generated key is lexicographically smaller than the first key or not whether the node will be able to author the first block after the runtime upgrade.

During normal operation (i.e. not right after a RT upgrade) i can actually see it working correctly, trying AccountLookup with both keys

[Parachain] RAD: nimkey: [186, 194, 53, 9, 188, 183, 98, 240, 238, 4, 168, 103, 132, 101, 0, 241, 154, 57, 226, 161, 228, 67, 190, 15, 2, 143, 187, 48, 221, 114, 14, 42]    
[Parachain] RAD: validatorId: None    
[Parachain] RAD: no validator Id for Nimbus Key    
[Parachain] RAD: nimkey: [10, 249, 215, 142, 179, 31, 211, 131, 120, 15, 64, 172, 250, 247, 110, 22, 30, 214, 229, 238, 175, 155, 6, 200, 107, 51, 91, 6, 123, 222, 254, 32]    
[Parachain] RAD: validatorId: Some()    
[Parachain] 🙌 Starting consensus session

but on the first block following the runtime upgrade it's just

[Parachain] RAD: nimkey: [186, 194, 53, 9, 188, 183, 98, 240, 238, 4, 168, 103, 132, 101, 0, 241, 154, 57, 226, 161, 228, 67, 190, 15, 2, 143, 187, 48, 221, 114, 14, 42]    
[Parachain] RAD: validatorId: None    
[Parachain] RAD: no validator Id for Nimbus Key    

=> RT panics with No Account Mapped to this NimbusId , chain keeps stalled

@Garandor Garandor changed the title Updating Nimbus SessionKey is ignored. Client continues proposing new blocks with the old key Updating Nimbus SessionKey is ignored. Client continues proposing new blocks with the old key on runtime upgrade Aug 12, 2022
@librelois
Copy link
Contributor

@Garandor the solution is to add a new configuration field skip_prediction_on_runtime_upgrade: bool, because some projects like moonbeam needs this behavior.
If you can work on that I'll review your PR :)

@Garandor
Copy link
Contributor Author

Garandor commented Aug 12, 2022

Thanks, I'll do that. However, in the short term we'd like to release (our node with) a workaround.

Are there any drawbacks to consider if we just remove the runtime_upgraded special handling logic from our nimbus fork for the time being?

@librelois
Copy link
Contributor

Are there any drawbacks to consider if we just remove the runtime_upgraded special handling logic from our nimbus fork for the time being?

I think it will solve your problem in the short term, I don't see any contraindication, but you have to test it of course

@JoshOrndorff
Copy link
Contributor

JoshOrndorff commented Aug 16, 2022

As I wrote in #4, I think a better approach would be

pub enum SkipPrediction {
  Always,
  Never,
  AfterRuntimeUpgade,
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants