Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print remote and local GID in error trace #779

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

YulunW
Copy link

@YulunW YulunW commented Feb 2, 2023

The remote and local GID information would be very useful in finding the root cause of a network issue. It could help us quickly pinpoint the two faulty GIDs without inspecting all the possible paths. Therefore, we would like to include this information in the error trace to speed up error tracing.

Tested the performance with all_reduce_perf -b 8 -e 256M -f 2 -g 2 before and after the change, there is no impact to the performance:
Before:

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
   134217728      33554432     float     sum      -1    751.6  178.57  178.57      0    750.3  178.88  178.88      0
   268435456      67108864     float     sum      -1   1423.2  188.62  188.62      0   1424.1  188.49  188.49      0

After:

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
   134217728      33554432     float     sum      -1    751.6  178.58  178.58      0    749.8  179.01  179.01      0
   268435456      67108864     float     sum      -1   1422.7  188.68  188.68      0   1422.5  188.71  188.71      0

Add local and remote GID info in error trace.
Modified ncclIbSendComm and ncclIbReceiveComm to include remote and local
GID information.
@YulunW YulunW marked this pull request as ready for review February 2, 2023 21:35
@sjeaugey
Copy link
Member

sjeaugey commented Feb 3, 2023

Thanks for the request. It makes sense. I'm not sure I'd have implemented it that way, but this would indeed help debugging. We're pretty busy at the moment so not sure we'll have time to work on that soon, but please ping us again if we don't make any progress.

src/transport/net_ib.cc Outdated Show resolved Hide resolved
src/transport/net_ib.cc Show resolved Hide resolved
src/transport/net_ib.cc Outdated Show resolved Hide resolved
src/transport/net_ib.cc Outdated Show resolved Hide resolved
ncclSocketToString(&addr, line), wc->status, wc->opcode, wc->byte_len, wc->vendor_err);
if (r->sendComm != NULL) {
inet_ntop(AF_INET6, &(r->sendComm->localGid), localGid, sizeof(localGid));
inet_ntop(AF_INET6, &(r->sendComm->remoteGid), remoteGid, sizeof(remoteGid));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These inet_ntop() calls may fail by returning NULL (see https://man7.org/linux/man-pages/man3/inet_ntop.3.html). For completeness, I think we should check inet_ntop() failures, in which case we can either fall back to the old WARN message or just print "unknown" for whichever GID that failed?

Also I think you can refactor this error handle out to a helper method if you feel it is getting complex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the logic to handle inet_ntop failures.

Yulun Wang added 3 commits February 9, 2023 20:45
Add comment for remoteGid and localGid varialbe in recv/send comm to
explain what they are and why we need them in the struct.
Since it is possible for inet_ntop to fail, add logic to handle the
case where inet_ntop failed to parse GID as an INET6 address. Also
combined some common logic to make the code a little cleaner.
Fix a bug that will cause segfault due to reading sendComm in the path
where it is supposed to read readComm.
@AddyLaddy
Copy link
Collaborator

There are compilation warnings from your patch. I believe the WARN messages are not being passed the correct string parameters:

transport/net_ib.cc:1289:16: warning: format ‘%s’ expects argument of type ‘char*’, but argument 12 has type ‘ibv_gid*’ [-Wformat=]
 1289 |           WARN("NET/IB : Got completion from peer %s with error %d, opcode %d, len %d, vendor err %d (%s) localGid %s remoteGid %s",
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1290 |               ncclSocketToString(&addr, line), wc->status, wc->opcode, wc->byte_len, wc->vendor_err,commType, localGid, remoteGid);
      |                                                                                                               ~~~~~~~~
      |                                                                                                               |
      |                                                                                                               ibv_gid*

Previeously the WARN message was using the ibv_gid as string by mistake.
Change it to use the correct string version of localGid and remoteGid.
@YulunW
Copy link
Author

YulunW commented Feb 21, 2023

Ah thanks for the heads up! Yes I am supposed to use the string version. Just changed it.

sjeaugey added a commit that referenced this pull request Apr 19, 2023
Add support for IB SHARP to NVLS (NVLink SHARP algorithm).
Add NVLS+Tree algorithm.
Add support for memory management using cuMem* functions.
Use all NICs for Send/Receive operations on systems with more than
one NIC per GPU (#804).
Add ncclCommSplit primitive, with resource sharing option in config.
Fix alltoallv hang (#788)
Increase number of channels on H100 when we're not limited by NVLink.
Improve error reporting in case of IB failure, printing local and
remote ID (#779).
Add build option to allow compilation against RDMA includes instead
of dynamically loading IB verbs symbols (#802).
Fix context creation for progress thread (#803).
NET/IB: add option to use multiple QPs in round-robin mode.
Fix tree performance issue when NVB is disabled on HCM topologies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants