-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NCCL] warn when barrier guesses device to use #54991
Conversation
Actual proposed fix is in #53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 532b673 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Actual proposed fix is in #53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/) ghstack-source-id: 125304647 Pull Request resolved: #54991
" using best-guess GPU ", | ||
deviceIdx, | ||
" to perform barrier as devices used by this process are currently unknown. ", | ||
"Specify device_ids in barrier() to force use of a particular device." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to say "To avoid any potential hang due to a wrong guess, please specify ...". This can improve the debuggability by mentioning the potential outcome if the warning here is ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for improving the debuggability!
torch/lib/c10d/ProcessGroupNCCL.cpp
Outdated
" to perform barrier as devices used by this process are currently unknown. ", | ||
"Specify device_ids in barrier() to force use of a particular device." | ||
); | ||
LOG(WARNING) << bestGuessWarning; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Since bestGuessWarning
wont' be used anywhere else, can just log everything directly after LOG(WARNING)
and no need to construct a separate string.
Actual proposed fix is in #53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/) [ghstack-poisoned]
Actual proposed fix is in #53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/) [ghstack-poisoned]
Pull Request resolved: #54991 Actual proposed fix is in #53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). ghstack-source-id: 126351889 Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)
This pull request has been merged in 657b66e. |
Summary: Pull Request resolved: pytorch#54991 Actual proposed fix is in pytorch#53934, in the meantime, would be useful to include this LOG when barrier does not know what devices to use, and suggest the workaround of passing in device_ids into barrier(). ghstack-source-id: 126351889 Test Plan: CI Reviewed By: SciPioneer Differential Revision: D27444917 fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
Stack from ghstack:
Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
Differential Revision: D27444917