-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about send/recv scheduling #784
Comments
Deadlock avoidance in send/recv is solely based on the principle of rotating pairs: when you send to rank + X (or node + X), you also receive at the same time from rank - X (or node - X). Coupling each send to its symmetric receive splits any alltoall[v] operation into a series of rings which can't hang. In practice, we don't execute one ring at a time; we have many rings happening in parallel (p2p nChannels x 8, up to 256) so for relatively small sizes, everything is effectively happening in parallel. |
Thanks a lot! Yes, I understand send/recv will be scheduled to a ring according to sendOrder and recvOrder. Parallel send/recv requires these send recv to be scheduled into the same ncclWork. But it seems under extreme condition, a paried recv and send on the ring can be schedule to different ncclWork? https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L659 As recv and tasks are scheduled to ncclWork via two seperate calls of |
That should not be the case, although there is always a potential for bugs. I'd need to spend quite some time to get to this algorithm again and see whether there is a case where this could not be the case. Maybe @jbachan has a fresher view on this. |
I think it might be a bug since when |
Let assume we modify the boundary check here (https://github.com/NVIDIA/nccl/blob/master/src/enqueue.cc#L209) from: Then in a In this case, if we modify sendrecv test https://github.com/NVIDIA/nccl-tests/blob/master/src/sendrecv.cu#L46 to send and recv from itself before send/recv to peer:
Then running with the following command on 2 GPUs will hang: |
Yes, that's why we have 8 sends and 8 receives per p2pWorkElem, and we can't replace a send by a receive or vice versa (receives always use even slots and sends use odd slots IIRC). So we always have a slot for a pair of send/receive. |
That's possible indeed when we have sends without a receive and then send/receive pairs. I'd let @jbachan review that possibility and opine. |
I haven't been able to convince myself that we are bug free, so I'll just blather about how I wish it worked: Deadlock freedom can be achieved even if we processed at most one send OR recv at a time. To envision how, consider each matching pair of (send(), recv()) is a global collective involving all ranks, except all but 2 (1 sender + 1 receiver) have nothing to do, like an extremely sparse alltoallv. Let's introduce a new collective call with these semantics:
Since the implementation of ncclSendRecv is just enqueueing send and recv tasks to the device only if it's participating in this send/recv, it's silly to require all ranks to issue all ncclSendRecv's, so we relax that to only ranks which are either the sender or receiver need to issue the call. Since not all ranks issue every ncclSendRecv, we can no longer demand they all issue the same ncclSendRecv's in the same order. Instead the new constraint is that there exists some global order of ncclSendRecv's, such that the order submitted by any chosen rank will never be in violation of it. We can now simplify the code to:
With that we can just define
And this hints at how I think NCCL should work, it should have one global SendRecv order against which all local ncclSend's and ncclRecv's are ordered such that even if the GPU were to process this list serially we would still be deadlock free. Adding parallelism to that is just for performance. We could store this order locally as an array of Unfortunately what we actually have are two separate orders |
Hi, I have a question about how P2P send/recv tasks are scheduled into kernel plans.
It seems in
scheduleP2pTasksToPlan
NCCL schedules send/recv tasks in a group according to a sendOrder and recvOrder that all peers have consensus on, i.e., at i-th loop, if rank r2's recvOrder[i] == r1, then we must have sendOrder[i] == r2 on rank r1.Hence, for a specific i, we may have the following send/recv pattern (i=2 for intra-node):
1->3->5->1
rank 1: sendPeer=3, recvPeer=5
rank 3: sendPeer=5, recvPeer=1
rank 5: sendPeer=1, recvPeer=3
In this case, I wonder how send/recv ncclWorkElemP2p are scheduled that prevents deadlock. It seems when
addP2pToPlan
is called, we may schedule a recv ncclWorkElemP2p to a ncclWork, and a send ncclWorkElemP2p to a following ncclWork, hence send/recv will not execute in parallel. This could happen, e.g., if we only have 1 channel for P2P communication, and that channel's workQueue tail ncclWork has already have 8 p2pSend workelems.We may have rank 1 wait recv for rank 5, rank 3 waits for rank 1, rank 5 waits for rank 3. The three ranks' send tasks resides in another ncclWork, hence will be blocked until the recv finishes, which does not seem to be the case.
Thanks!
The text was updated successfully, but these errors were encountered: