Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alltoallv hangs with NCCL 2.12 and newer #788

Closed
maxhgerlach opened this issue Feb 22, 2023 · 16 comments
Closed

Alltoallv hangs with NCCL 2.12 and newer #788

maxhgerlach opened this issue Feb 22, 2023 · 16 comments

Comments

@maxhgerlach
Copy link

maxhgerlach commented Feb 22, 2023

I face an issue with NCCL releases newer than v2.11.4. I could reproduce the problem with v2.12.7, v2.15.5, and v2.16.5. NCCL is built with CUDA toolkit 11.2 in each case. nvidia-smi reports: NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4

The issue surfaces in a Horovod training (current master) that incorporates several NCCL alltoall operations in its forward and backward passes. After ~90 training steps it hangs reproducibly. The training incorporates 32 workers on 4 hosts with 8 GPUs each (Nvidia A40), connected via InfiniBand, two AMD EPYC Zen3 CPUs on each node. Although the alltoall operations incorporate all of these processes, Horovod reports stalls only for a subset of the processes (all other ranks have reported to be ready for the next collective operation). For example, ranks 8...15 and 24...31 (corresponding to nodes 1 and 3 of 4) would stall. So there is a pattern related to network topology.

I attached lldb to one of the hanging processes (rank 8). Note that this is with v2.12.7 (the last release that I tried), but the NCCL and Horovod thread backtraces looked very similar with v2.15.5.

Here are some excerpts from bt all in the debugger:

  thread #150, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007ff6634df00c libc.so.6`__read + 76
    frame #1: 0x00007fe6de8abef4 libibverbs.so`__ibv_get_async_event_1_1 [inlined] read(__nbytes=<unavailable>, __buf=<unavailable>, __fd=<unavailable>) at unistd.h:44:10
    frame #2: 0x00007fe6de8abeec libibverbs.so`__ibv_get_async_event_1_1(context=0x00007fe6e0369df0, event=0x00007fe6ddf13ec0) at device.c:454
    frame #3: 0x00007ff446982ae6 mpi_lib.cpython-39-x86_64-linux-gnu.so`wrap_ibv_get_async_event(context=<unavailable>, event=<unavailable>) at ibvwrap.cc:230:3
    frame #4: 0x00007ff44694a8fb mpi_lib.cpython-39-x86_64-linux-gnu.so`::ncclIbAsyncThreadMain(args=0x00007fe6e0369df0) at net_ib.cc:90:48
    frame #5: 0x00007ff6633b6609 libpthread.so.0`start_thread + 217
    frame #6: 0x00007ff6634f0133 libc.so.6`__clone + 67
  thread #151, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007ff6634e399f libc.so.6`__poll + 79
    frame #1: 0x00007ff44693476e mpi_lib.cpython-39-x86_64-linux-gnu.so`ncclProxyService(void*) [inlined] poll(__timeout=<unavailable>, __nfds=<unavailable>, __fds=<unavailable>) at poll2.h:46:23
    frame #2: 0x00007ff446934769 mpi_lib.cpython-39-x86_64-linux-gnu.so`ncclProxyService(_args=0x00007fe6e03acd20) at proxy.cc:980
    frame #3: 0x00007ff6633b6609 libpthread.so.0`start_thread + 217
    frame #4: 0x00007ff6634f0133 libc.so.6`__clone + 67
  thread #152, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007ff6634d371b libc.so.6`__sched_yield + 11
    frame #1: 0x00007ff446933edd mpi_lib.cpython-39-x86_64-linux-gnu.so`ncclProxyProgress(comm_=0x00007fe6e03acd20) at proxy.cc:629:20
    frame #2: 0x00007ff6633b6609 libpthread.so.0`start_thread + 217
    frame #3: 0x00007ff6634f0133 libc.so.6`__clone + 67

--> three threads running NCCL functions.

  thread #149, name = 'python', stop reason = signal SIGSTOP
    frame #0: 0x00007ff4c42944df libcuda.so.1`___lldb_unnamed_symbol5986$$libcuda.so.1 + 175
    frame #1: 0x00007ff4c40d89e3 libcuda.so.1`___lldb_unnamed_symbol1686$$libcuda.so.1 + 403
    frame #2: 0x00007ff4c412cd15 libcuda.so.1`___lldb_unnamed_symbol2199$$libcuda.so.1 + 101
    frame #3: 0x00007ff4c4363a26 libcuda.so.1`___lldb_unnamed_symbol8309$$libcuda.so.1 + 22
    frame #4: 0x00007ff4c4364591 libcuda.so.1`___lldb_unnamed_symbol8321$$libcuda.so.1 + 225
    frame #5: 0x00007ff4c41abaf7 libcuda.so.1`___lldb_unnamed_symbol3578$$libcuda.so.1 + 199
    frame #6: 0x00007ff4c4123315 libcuda.so.1`___lldb_unnamed_symbol2168$$libcuda.so.1 + 421
    frame #7: 0x00007ff4c409019b libcuda.so.1`___lldb_unnamed_symbol923$$libcuda.so.1 + 155
    frame #8: 0x00007ff4c4239865 libcuda.so.1`___lldb_unnamed_symbol4967$$libcuda.so.1 + 69
    frame #9: 0x00007ff58df262be libcudart.so.11.0`___lldb_unnamed_symbol80$$libcudart.so.11.0 + 14
    frame #10: 0x00007ff58df57708 libcudart.so.11.0`cudaEventSynchronize + 344
    frame #11: 0x00007ff4468211e1 mpi_lib.cpython-39-x86_64-linux-gnu.so`horovod::common::GPUContext::impl::WaitForEvents(this=0x0000000005d58400, event_queue=0x00007fe6e095c8a0, entries=size=1, timeline=0x00007ff4528cfe18, error_check_callback=0x00007fe6e095c900)> const&) at cuda_operations.cc:124:53
    frame #12: 0x00007ff446820006 mpi_lib.cpython-39-x86_64-linux-gnu.so`horovod::common::GPUContext::WaitForEvents(this=0x00007ff45c0d0a20, event_queue=0x00007fe6e095c8a0, entries=size=1, timeline=0x00007ff4528cfe18, error_check_callback=0x00007fe6e095c900)> const&) at gpu_context_impl.cc:28:23
    frame #13: 0x00007ff44682be60 mpi_lib.cpython-39-x86_64-linux-gnu.so`operator(__closure=0x00007fe6e095c790) at gpu_operations.cc:108:39
    frame #14: 0x00007ff446830c7a mpi_lib.cpython-39-x86_64-linux-gnu.so`std::_Function_handler<void(), horovod::common::GPUOpContext::FinalizeGPUQueue(std::vector<horovod::common::TensorTableEntry>&, bool, const std::function<void()>&)::<lambda()> >::_M_invoke(__functor=0x00007fe6df3fbdb0) at std_function.h:300:37
    frame #15: 0x00007ff4466f2c58 mpi_lib.cpython-39-x86_64-linux-gnu.so`std::function<void ()>::operator(this=0x00007fe6df3fbdb0)() const at std_function.h:688:14
    frame #16: 0x00007ff44676cde9 mpi_lib.cpython-39-x86_64-linux-gnu.so`horovod::common::ThreadPool::loop(this=0x00007ff45c0d0a38) at thread_pool.cc:62:6
    frame #17: 0x00007ff44676feb9 mpi_lib.cpython-39-x86_64-linux-gnu.so`void std::__invoke_impl<void, void (horovod::common::ThreadPool::*)(), horovod::common::ThreadPool*>((null)=__invoke_memfun_deref @ 0x00007fe6df3fbe20, __f=0x00007fe6e034e780, __t=0x00007fe6e034e778)(), horovod::common::ThreadPool*&&) at invoke.h:73:46
    frame #18: 0x00007ff44676fdd3 mpi_lib.cpython-39-x86_64-linux-gnu.so`std::__invoke_result<void (horovod::common::ThreadPool::*)(), horovod::common::ThreadPool*>::type std::__invoke<void (__fn=0x00007fe6e034e780, (null)=0x00007fe6e034e778)(), horovod::common::ThreadPool*>(void (horovod::common::ThreadPool::*&&)(), horovod::common::ThreadPool*&&) at invoke.h:95:40
    frame #19: 0x00007ff44676fd23 mpi_lib.cpython-39-x86_64-linux-gnu.so`void std::thread::_Invoker<std::tuple<void (horovod::common::ThreadPool::*)(), horovod::common::ThreadPool*> >::_M_invoke<0ul, 1ul>(this=0x00007fe6e034e778, (null)=_Index_tuple<0, 1> @ 0x00007fe6df3fbe90) at thread:244:26
    frame #20: 0x00007ff44676fcc5 mpi_lib.cpython-39-x86_64-linux-gnu.so`std::thread::_Invoker<std::tuple<void (horovod::common::ThreadPool::*)(), horovod::common::ThreadPool*> >::operator(this=0x00007fe6e034e778)() at thread:251:31
    frame #21: 0x00007ff44676fc96 mpi_lib.cpython-39-x86_64-linux-gnu.so`std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (horovod::common::ThreadPool::*)(), horovod::common::ThreadPool*> > >::_M_run(this=0x00007fe6e034e770) at thread:195:13
    frame #22: 0x00007ff58fbf7490 libtensorflow_framework.so.2`execute_native_thread_routine at thread.cc:80:18
    frame #23: 0x00007ff6633b6609 libpthread.so.0`start_thread + 217
    frame #24: 0x00007ff6634f0133 libc.so.6`__clone + 67

--> This is the Horovod thread finalizing the GPU queue after an NCCL Alltoall op (launched via https://github.com/horovod/horovod/blob/7a20abeffd12c857c8f392e60fdcd1f648bffe5d/horovod/common/ops/nccl_operations.cc#L1206). It blocks at cudaEventSynchronize. Note that on rank 0 this is not running (any more).

I looked more closely at thread 149 in frame 11 and printed the Horovod tensor table entry: rank8_thread149_frame11.txt -> That basically shows that it's an alltoall operation incorporating all 32 processes (all entries of splits are non-zero).

nccl-tests with NCCL_TOPO_DUMP_FILE=system.txt produced this topology file: system.txt

NCCL 2.12 came with alltoall-related optimizations, which might explain why the problem doesn't occur in earlier versions:

Improved alltoall latency by aggregating messages within a node to a given destination.

(https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-12-7.html#rel_2-12-7)

@sjeaugey
Copy link
Member

Can you try again with NCCL_PXN_DISABLE=1 and disable the 2.12 optimizations?

Given the node topology, I'm not sure why it would make a difference though; there is no PCI switch, and PXN is only used for GPU-NIC communication through NVLink and PCI switches.

If it doesn't make a difference, I'd want to confirm we are actually launching the alltoall operation on all ranks (and it's not that some ranks are stuck outside of NCCL). To do that, I'd set NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG=INFO, but the output can be quite verbose, so I'd advise to create multiple files (one per rank) using a wrapper script or setting NCCL_DEBUG_FILE=nccl_log.%p.

@maxhgerlach
Copy link
Author

Can you try again with NCCL_PXN_DISABLE=1 and disable the 2.12 optimizations?

Unfortunately, with NCCL 2.16.5 I still observe these hangs even if I set NCCL_PXN_DISABLE=1. (I get log messages like [4] NCCL INFO NCCL_PXN_DISABLE set by environment to 1. from each rank, so the setting seems to be effective.) My test training just made it to step 100, which is a few steps further than before, but that could just be by random chance, of course.

Given the node topology, I'm not sure why it would make a difference though; there is no PCI switch, and PXN is only used for GPU-NIC communication through NVLink and PCI switches.

We do have NVLink bridges on pairs of A40 GPUs here if that makes a difference. Also, I think we never installed drivers for GPUDirect RDMA or experimented with that class of features (because apparently we saturate 100 Gb InfiniBand quite well as it is).

If it doesn't make a difference, I'd want to confirm we are actually launching the alltoall operation on all ranks (and it's not that some ranks are stuck outside of NCCL). To do that, I'd set NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG=INFO, but the output can be quite verbose, so I'd advise to create multiple files (one per rank) using a wrapper script or setting NCCL_DEBUG_FILE=nccl_log.%p.

Sure, I can collect these logs tomorrow or next week, happy to provide information that could help clear this up. (Though I had Horovod trace logging enabled before and that indicated that all ranks submitted the alltoall, but only ranks 0-7 and 16-23 then submitted the allreduce that would follow.)

@sjeaugey
Copy link
Member

Ok thanks for the confirmation. So indeed PXN is not the reason for the hang given we don't use it (no PCI switch between GPU and NIC).

Did you run with NCCL_DEBUG=WARN or NCCL_DEBUG=INFO (no NCCL_DEBUG_SUBSYS set) and if so, it there any WARN line in the log? It could be that there was an error; I'm not sure whether Horovod properly traps asynchronous NCCL communication errors and reports it.

One more thing I'd like to confirm: are you launching one process per GPU or do you manage multiple GPUs per process?

@maxhgerlach
Copy link
Author

maxhgerlach commented Feb 24, 2023

Narrowing this down to not being caused by PXN is some progress. 🙂

Did you run with NCCL_DEBUG=WARN or NCCL_DEBUG=INFO (no NCCL_DEBUG_SUBSYS set) and if so, it there any WARN line in the log? It could be that there was an error; I'm not sure whether Horovod properly traps asynchronous NCCL communication errors and reports it.

The run was with NCCL_DEBUG=INFO, but there are no NCCL warning or error messages. I grepped for NCCL on the collected output of all 32 processes, here are the messages (all are INFO): stderr_NCCL.txt

One more thing I'd like to confirm: are you launching one process per GPU or do you manage multiple GPUs per process?

This is launched via MPI with one GPU per process.

@maxhgerlach
Copy link
Author

If it doesn't make a difference, I'd want to confirm we are actually launching the alltoall operation on all ranks (and it's not that some ranks are stuck outside of NCCL). To do that, I'd set NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG=INFO,

I've also reproduced the hang with NCCL 2.16.5 another time, having set NCCL_PXN_DISABLE=1 and enabled NCCL_DEBUG_SUBSYS=COLL logging.

This time ranks 0-7 and and 16-23 stall:

Fri Feb 24 11:52:56 2023[1,0]<stderr>:[2023-02-24 11:52:56.451445: W horovod/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Fri Feb 24 11:52:56 2023[1,0]<stderr>:Missing ranks:
Fri Feb 24 11:52:56 2023[1,0]<stderr>:0: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:1: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:2: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:3: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:4: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:5: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:6: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:7: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:16: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:17: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:18: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:19: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:20: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:21: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:22: [some input tensors for Horovod]
Fri Feb 24 11:52:56 2023[1,0]<stderr>:23: [some input tensors for Horovod]

I extracted a tail of the NCCL INFO messages for rank 0 (stalling) and rank 8 (not stalling). These start with 16x Allreduce for both ranks, then very many Recv and Send messages follow.

@sjeaugey
Copy link
Member

Thanks, that's very interesting. So this is not an alltoall but an alltoallv, with very different sizes between ranks. I'm a bit afraid it could be a bug in NCCL (in the way we split chunks on channels and such).

It would be awesome if you could gather the size matrix for the alltoall that hangs. Basically, for each rank, get the logs to select these columns from the last 64 lines (last alltoall):

[8] Recv: count 16384 root 0
[8] Send: count 933888 root 0
...
[8] Recv: count 561152 root 31
[8] Send: count 1216512 root 31

That way, we could try to reproduce the issue internally and fix it.
Alternatively, just provide the last 64 lines of each rank's log and we'll extract those values.

Thanks!

@maxhgerlach
Copy link
Author

That's right, I forgot to mention that the alltoall operations aren't symmetric, they are indeed like MPI_Alltoallv. Thanks for looking into this apparent bug!

I've gathered and slightly cleaned up the Send/Recv log lines tagged with 'opCount 43e8' for each rank. Those should correspond to the last and hanging alltoallv. It's less than 64 lines for most ranks because some sizes may be zero and Horovod skips the ncclSend/ncclRecv in such a case.

sendrecv_opCount_43e8.zip

@sjeaugey
Copy link
Member

Thanks a lot! That should be enough to reproduce the alltoallv scenario.

@sjeaugey
Copy link
Member

Hi, I've looked at the trace and I found that for example, 04 was not sending anything to 17 but 17 is receiving 1597440 bytes from 04.

I've not found any mismatch, only cases where one side was not sending or receiving (size = 0?) but the other side did send or receive.

The full diff:

04 -> 17 send  recv 1597440
11 -> 03 send 36864 recv 
11 -> 05 send 24576 recv 
11 -> 06 send 86016 recv 
11 -> 07 send 40960 recv 
11 -> 09 send 32768 recv 
11 -> 11 send 69632 recv 
11 -> 13 send 16384 recv 
11 -> 14 send 32768 recv 
11 -> 18 send 69632 recv 
11 -> 23 send 20480 recv 
11 -> 24 send 28672 recv 
11 -> 26 send 61440 recv 
11 -> 27 send 20480 recv 
11 -> 28 send 65536 recv 
12 -> 12 send 4096 recv 
16 -> 15 send 4096 recv 
16 -> 22 send 4096 recv 
19 -> 11 send  recv 192512
19 -> 29 send 4096 recv 
22 -> 03 send 12288 recv 
22 -> 05 send 40960 recv 
22 -> 09 send 16384 recv 
22 -> 16 send 32768 recv 
22 -> 24 send 36864 recv 
25 -> 11 send  recv 184320
28 -> 22 send  recv 1630208
29 -> 04 send 888832 recv 
29 -> 21 send 794624 recv 
30 -> 20 send  recv 704512
30 -> 31 send  recv 622592

@maxhgerlach
Copy link
Author

maxhgerlach commented Mar 1, 2023

Sorry for that. Some lines were missing in the logs.

I fixed that and parsed them myself this time. Here, send_mat is the transpose of recv_mat, so I think the data is consistent now.

recv_mat.csv.txt
send_mat.csv.txt
fixed_logs_and_jupyter_notebook.zip

@maxhgerlach maxhgerlach changed the title Alltoall hangs with NCCL 2.12 and newer Alltoallv hangs with NCCL 2.12 and newer Mar 9, 2023
@sjeaugey
Copy link
Member

Hi, I finally found time to try it again, and I believe I managed to reproduce it.

Can you check that setting NCCL_MIN_NCHANNELS=16 workarounds the issue? Just to confirm we're seeing the same thing.

@sjeaugey
Copy link
Member

Ok I could confirm the issue is the one reported here: #784 (comment).

We don't send to some ranks, so the workElems start to mix peers from different nodes, and we end up with a deadlock when we don't have enough channels for concurrent progress.

Sorry for the delay. We'll try to fix this ASAP.

@maxhgerlach
Copy link
Author

Hi @sjeaugey,

Hi, I finally found time to try it again, and I believe I managed to reproduce it.

Can you check that setting NCCL_MIN_NCHANNELS=16 workarounds the issue? Just to confirm we're seeing the same thing.

I found some time today to look into this again: Indeed, the hangs disappear with NCCL 2.16.5 when I set NCCL_MIN_NCHANNELS=16. It's great to have this workaround and to confirm we see the same issue!

We'll try to fix this ASAP.

Awesome, thanks a lot!

sjeaugey added a commit that referenced this issue Apr 19, 2023
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
@sjeaugey
Copy link
Member

sjeaugey commented Apr 19, 2023

@maxhgerlach alltoallv bugs should be solved with NCCL 2.18 which just got posted as preview on the v2.18 branch. Feel free to give it a try. Thanks!

@maxhgerlach
Copy link
Author

Sorry for the late response, @sjeaugey.

With the NCCL preview 2.18.1-1 I don't see any hangs in 10k+ training steps (without having to adapt NCCL_MIN_NCHANNELS; according to logs NCCL uses just two channels). So it really looks as if it fixes our bug. Thank you!

@sjeaugey
Copy link
Member

sjeaugey commented May 5, 2023

Thanks a lot for confirming. Closing as I merged 2.18.1 to master.

@sjeaugey sjeaugey closed this as completed May 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants