Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCCL] warn when barrier guesses device to use #54991

Closed
wants to merge 3 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Mar 30, 2021

Stack from ghstack:

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: D27444917

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 30, 2021

💊 CI failures summary and remediations

As of commit 532b673 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-scanned failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot facebook-github-bot added oncall: distributed Add this issue/PR to distributed oncall triage queue cla signed labels Mar 30, 2021
rohan-varma added a commit that referenced this pull request Mar 30, 2021
Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

ghstack-source-id: 125304647
Pull Request resolved: #54991
@rohan-varma rohan-varma changed the title [NCCL] Log when barrier guesses device to use [NCCL] warn when barrier guesses device to use Mar 30, 2021
" using best-guess GPU ",
deviceIdx,
" to perform barrier as devices used by this process are currently unknown. ",
"Specify device_ids in barrier() to force use of a particular device."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to say "To avoid any potential hang due to a wrong guess, please specify ...". This can improve the debuggability by mentioning the potential outcome if the warning here is ignored.

Copy link
Contributor

@wayi1 wayi1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for improving the debuggability!

" to perform barrier as devices used by this process are currently unknown. ",
"Specify device_ids in barrier() to force use of a particular device."
);
LOG(WARNING) << bestGuessWarning;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since bestGuessWarning wont' be used anywhere else, can just log everything directly after LOG(WARNING) and no need to construct a separate string.

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 12, 2021
Pull Request resolved: #54991

Actual proposed fix is in
#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Differential Revision: [D27444917](https://our.internmc.facebook.com/intern/diff/D27444917/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 657b66e.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/275/head branch April 17, 2021 14:17
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#54991

Actual proposed fix is in
pytorch#53934, in the meantime, would be useful
to include this LOG when barrier does not know what devices to use, and suggest
the workaround of passing in device_ids into barrier().
ghstack-source-id: 126351889

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D27444917

fbshipit-source-id: 0f269c5a7732e5be6e51adfca7ef70d04ffd71d3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants