-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the new ucx clientId apis #4357
Conversation
A quick update on this PR.
|
Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
I have yet another race condition with this change unfortunately. Here's the scenario: 1a. Executor 1 starts an endpoint to executor 2 So this should mean we need to queue messages until the executor's endpoint is "accepted", so we are guaranteed it is not going away, or may need a retry mechanism. |
Closing this for now, as it is not a high priority. We may come back to this in the future. |
Signed-off-by: Alessandro Bellina abellina@nvidia.com
Closes #3106.
This is a change to enable use of the new
clientId
in UCX 1.12.0. This also moves us to the new version of UCX. In a nutshell, this functionally prevents us from having redundant connections between peers. As the connection establishment is a race condition, since peers are connecting to each other when they receive a heartbeat or when they intend to fetch shuffle blocks, we now have enough information in theUcpConnectionRequest
to reject it, if we already had initiated a connection to the peer that is trying to connect to us.This is in draft. UCX 1.12 is not released yet so, I am posting to get some reviews from @petro-rudenko and to give a heads up this will go into branch-22.02 when UCX 1.12 is officially released.