Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new ucx clientId apis #4357

Closed
wants to merge 3 commits into from

Conversation

abellina
Copy link
Collaborator

@abellina abellina commented Dec 14, 2021

Signed-off-by: Alessandro Bellina abellina@nvidia.com

Closes #3106.

This is a change to enable use of the new clientId in UCX 1.12.0. This also moves us to the new version of UCX. In a nutshell, this functionally prevents us from having redundant connections between peers. As the connection establishment is a race condition, since peers are connecting to each other when they receive a heartbeat or when they intend to fetch shuffle blocks, we now have enough information in the UcpConnectionRequest to reject it, if we already had initiated a connection to the peer that is trying to connect to us.

This is in draft. UCX 1.12 is not released yet so, I am posting to get some reviews from @petro-rudenko and to give a heads up this will go into branch-22.02 when UCX 1.12 is officially released.

@abellina abellina added the shuffle things that impact the shuffle plugin label Dec 14, 2021
petro-rudenko
petro-rudenko previously approved these changes Dec 14, 2021
@abellina abellina changed the base branch from branch-22.02 to branch-22.04 January 19, 2022 18:59
@abellina abellina dismissed petro-rudenko’s stale review January 19, 2022 18:59

The base branch was changed.

@abellina
Copy link
Collaborator Author

A quick update on this PR.

  • There is a race condition in the code in this PR. This is the case where there is a pending connection request from executor A to B, and also from B to A. The way the code currently works, both connections will be rejected, because both executors A and B had an outbound connection to the peer. One solution that I am looking at is to only reject if the executorId for my peer is less than (or greater than) my executor.

  • We are seeing a regression in UCX 1.12 where the progress thread is not going to sleep per the wakeup mechanism we've been using so far. This leads to inconsistent timings that hid the issue but an nsight trace shows the progress thread doing this unfortunately. This means we are going to need to wait to get this in until UCX 1.12.1 very likely.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@abellina
Copy link
Collaborator Author

I have yet another race condition with this change unfortunately. Here's the scenario:

1a. Executor 1 starts an endpoint to executor 2
1b. Executor 1 sends a message using this endpoint.
1c. Executor 2 starts an endpoint to executor 1
2a. Executor 2 rejects the ConnectionRequest from executor 1 (1a) since it raced at creation time.
2b. Executor 1 sees an error, since the message sent in (1b) is rejected because the endpoint used ended up getting rejected.

So this should mean we need to queue messages until the executor's endpoint is "accepted", so we are guaranteed it is not going away, or may need a retry mechanism.

@abellina abellina changed the base branch from branch-22.04 to branch-22.06 March 28, 2022 16:00
@sameerz sameerz changed the base branch from branch-22.06 to branch-22.08 June 14, 2022 00:48
@abellina
Copy link
Collaborator Author

Closing this for now, as it is not a high priority. We may come back to this in the future.

@abellina abellina closed this Jul 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Use clientId to reject connection requests for peers with existing UCX endpoints
2 participants