-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alltoallv hangs with NCCL 2.12 and newer #788
Comments
Can you try again with Given the node topology, I'm not sure why it would make a difference though; there is no PCI switch, and PXN is only used for GPU-NIC communication through NVLink and PCI switches. If it doesn't make a difference, I'd want to confirm we are actually launching the alltoall operation on all ranks (and it's not that some ranks are stuck outside of NCCL). To do that, I'd set |
Unfortunately, with NCCL 2.16.5 I still observe these hangs even if I set
We do have NVLink bridges on pairs of A40 GPUs here if that makes a difference. Also, I think we never installed drivers for GPUDirect RDMA or experimented with that class of features (because apparently we saturate 100 Gb InfiniBand quite well as it is).
Sure, I can collect these logs tomorrow or next week, happy to provide information that could help clear this up. (Though I had Horovod trace logging enabled before and that indicated that all ranks submitted the alltoall, but only ranks 0-7 and 16-23 then submitted the allreduce that would follow.) |
Ok thanks for the confirmation. So indeed PXN is not the reason for the hang given we don't use it (no PCI switch between GPU and NIC). Did you run with One more thing I'd like to confirm: are you launching one process per GPU or do you manage multiple GPUs per process? |
Narrowing this down to not being caused by PXN is some progress. 🙂
The run was with
This is launched via MPI with one GPU per process. |
I've also reproduced the hang with NCCL 2.16.5 another time, having set This time ranks 0-7 and and 16-23 stall:
I extracted a tail of the |
Thanks, that's very interesting. So this is not an alltoall but an alltoallv, with very different sizes between ranks. I'm a bit afraid it could be a bug in NCCL (in the way we split chunks on channels and such). It would be awesome if you could gather the size matrix for the alltoall that hangs. Basically, for each rank, get the logs to select these columns from the last 64 lines (last alltoall):
That way, we could try to reproduce the issue internally and fix it. Thanks! |
That's right, I forgot to mention that the alltoall operations aren't symmetric, they are indeed like I've gathered and slightly cleaned up the Send/Recv log lines tagged with 'opCount 43e8' for each rank. Those should correspond to the last and hanging alltoallv. It's less than 64 lines for most ranks because some sizes may be zero and Horovod skips the |
Thanks a lot! That should be enough to reproduce the alltoallv scenario. |
Hi, I've looked at the trace and I found that for example, 04 was not sending anything to 17 but 17 is receiving 1597440 bytes from 04. I've not found any mismatch, only cases where one side was not sending or receiving (size = 0?) but the other side did send or receive. The full diff:
|
Sorry for that. Some lines were missing in the logs. I fixed that and parsed them myself this time. Here, recv_mat.csv.txt |
Hi, I finally found time to try it again, and I believe I managed to reproduce it. Can you check that setting |
Ok I could confirm the issue is the one reported here: #784 (comment). We don't send to some ranks, so the workElems start to mix peers from different nodes, and we end up with a deadlock when we don't have enough channels for concurrent progress. Sorry for the delay. We'll try to fix this ASAP. |
Hi @sjeaugey,
I found some time today to look into this again: Indeed, the hangs disappear with NCCL 2.16.5 when I set
Awesome, thanks a lot! |
Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.
@maxhgerlach alltoallv bugs should be solved with NCCL 2.18 which just got posted as preview on the |
Thanks a lot for confirming. Closing as I merged 2.18.1 to master. |
I face an issue with NCCL releases newer than v2.11.4. I could reproduce the problem with v2.12.7, v2.15.5, and v2.16.5. NCCL is built with CUDA toolkit 11.2 in each case.
nvidia-smi
reports:NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4
The issue surfaces in a Horovod training (current master) that incorporates several NCCL alltoall operations in its forward and backward passes. After ~90 training steps it hangs reproducibly. The training incorporates 32 workers on 4 hosts with 8 GPUs each (Nvidia A40), connected via InfiniBand, two AMD EPYC Zen3 CPUs on each node. Although the alltoall operations incorporate all of these processes, Horovod reports stalls only for a subset of the processes (all other ranks have reported to be ready for the next collective operation). For example, ranks 8...15 and 24...31 (corresponding to nodes 1 and 3 of 4) would stall. So there is a pattern related to network topology.
I attached
lldb
to one of the hanging processes (rank 8). Note that this is with v2.12.7 (the last release that I tried), but the NCCL and Horovod thread backtraces looked very similar with v2.15.5.Here are some excerpts from
bt all
in the debugger:--> three threads running NCCL functions.
--> This is the Horovod thread finalizing the GPU queue after an NCCL Alltoall op (launched via https://github.com/horovod/horovod/blob/7a20abeffd12c857c8f392e60fdcd1f648bffe5d/horovod/common/ops/nccl_operations.cc#L1206). It blocks at
cudaEventSynchronize
. Note that on rank 0 this is not running (any more).I looked more closely at thread 149 in frame 11 and printed the Horovod tensor table entry: rank8_thread149_frame11.txt -> That basically shows that it's an alltoall operation incorporating all 32 processes (all entries of
splits
are non-zero).nccl-tests with
NCCL_TOPO_DUMP_FILE=system.txt
produced this topology file: system.txtNCCL 2.12 came with alltoall-related optimizations, which might explain why the problem doesn't occur in earlier versions:
(https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-12-7.html#rel_2-12-7)
The text was updated successfully, but these errors were encountered: