-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Conversation
Could you add a PR description summarizing changes, intent, and assumptions? |
I don't have a strong opinion on this change (assuming it helps to some extent on Kusama) but I would like to think ahead to asynchronous backing somewhat. In the future stuff like pre-connect will be less important because the buffering will happen at the validator/backing level. There'll be some changes that might affect things here, particularly related to how collations are generated. e.g. in #5056 we propose moving the logic of when to schedule a collation entirely to the Cumulus implementation (or other collator implementation). In that model, This changeset isn't very large, so we shouldn't really mind doing some rewriting of these changes if they get us some short-term benefits in real world deployments. |
@rphmeier I've updated the description |
Wouldn't we assume that our core is currently unoccupied? And why is this assumption necessary? |
The pipeline is as follows:
|
While I think it makes sense to have some preconnection logic for occupied cores, this does still maintain some connection logic for when the core is free, right?
In practice, how is this "the collator knows" logic implemented in the Cumulus side? Not all collators author parablocks all the time. |
/// backed in the grandchild of relay parent, therefore based on the child. | ||
/// The core is expected to be occupied. | ||
/// | ||
/// **Important note**: parathread cores very well may happen to be free. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tangentially related, but as you have studied this part of the code and thought about parathreads, what information do you think the runtime needs to provide about parathread cores in the future? Since we also want to experiment with other kinds of exotic scheduling (2 or 3 chains sharing 1 slot and taking turns, for instance), it would be nice to have a unified abstraction for scheduling information in the runtime API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the current case when a parachain takes a core for the whole session duration it's fine to rely on scheduled-occupied-scheduled... behavior
With parathreads a core can happen e.g. to be free-scheduled-occupied, so we need to be aware in advance for which para id the core is going to be scheduled.
On the other hand, with async backing this change will be unnecessary, we could probably just drop it later
/// if the collator is aware of its assignment on the child of the given relay parent.\ | ||
/// Connects to both current and the next group if the block is | ||
/// [close](NEXT_GROUP_PRECONNECT_WINDOW) to the group rotation boundary. | ||
async fn preconnect<Context>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the is_preconnect_allowed
counterpart, I think this is cheap enough without refactoring, as it only has to run once per relay-parent at most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave light review. Not planning to dive too much deeper. Looks broadly sensible, as in this is behavior that is probably somewhat useful. Thanks!
However, since this is meant to close 5255, I don't really see this doing that. The original plan in #5255 was to query the collator before invoking fn produce_collation
with something like a fn will_produce_collation
and then connect to the validators of the current group if will_produce_collation
is true.
During the discussion with @eskimor we decided that implementing both It's also mentioned in #5255 |
According to my code, preconnect will not work if the core is free. I don't really know how to discover our validators group in this case.
For AuRa we can try to claim |
cfbecaf
to
0b85c86
Compare
It should be trivial if the core is scheduled. If the core is free, yeah, nothing to do. |
Note that #4381 also includes
which doesn't seem to be taken care of in this PR. To disconnect, we need to issue an empty |
In my opinion it makes sense to do it in a separate task, unless we can come up with a simple solution here. We have to figure out when to disconnect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work overall! Thanks!
|
||
fn poll(self: Pin<&mut Self>, _cx: &mut FuturesContext) -> Poll<Self::Output> { | ||
Poll::Ready(Some(CollationResult { collation: test_collation(), result_sender: None })) | ||
#[async_trait::async_trait] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a small test for the pre-connect functionality as well, I think.
) | ||
.await?; | ||
|
||
connect_to_validators(ctx, validators_group.validators.into_iter().collect()).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stateless connection mechanism can't really work, but the issue was present before already and could be an explanation for short stalls:
If we have two forks with different height, they might be in different backing groups, hence they would disconnect each other. I believe we had this discussion before ... @ordian I believe you had some reasoning, why this is not a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a problem at group rotation boundaries. In #4640 (was reverted) we used a union of those groups. Maybe we should prefer the new group to the old one in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't making NEXT_GROUP_PRECONNECT_WINDOW big enough cover this issue? How big the spread of block number in our view can be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would still lead to stalls, if the shorter fork is actually the better one. We should just make sure to be connected to both - I will have a look at that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't making NEXT_GROUP_PRECONNECT_WINDOW big enough cover this issue? How big the spread of block number in our view can be?
it is not really limited, larger spreads are just becoming increasingly unlikely. Indeed for pre-connect it will likely be less of a problem, due to NEXT_GROUP_PRECONNECT_WINDOW
.
Resolves #5255, #4381
In short: whenever possible, a collator should establish a connection to validators group in advance (for a given active leaf prepare for collating on the child)
Stuff to take into account: group rotations, if we're close to it, issue connections to both current and the next group assigned to our core
Assumptions: