Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update DDP for torch.distributed.run with gloo backend #3680

Merged
merged 35 commits into from
Jun 19, 2021
Merged
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
007902e
Update DDP for `torch.distributed.run`
glenn-jocher Jun 18, 2021
9bcb4ad
Add LOCAL_RANK
glenn-jocher Jun 18, 2021
b32bae0
remove opt.local_rank
glenn-jocher Jun 18, 2021
b467501
backend="gloo|nccl"
glenn-jocher Jun 18, 2021
c886538
print
glenn-jocher Jun 18, 2021
5d847dc
print
glenn-jocher Jun 18, 2021
26d0ecf
debug
glenn-jocher Jun 18, 2021
832ba4c
debug
glenn-jocher Jun 18, 2021
9a1bb01
os.getenv
glenn-jocher Jun 18, 2021
0e912df
gloo
glenn-jocher Jun 18, 2021
5f5e428
gloo
glenn-jocher Jun 18, 2021
e8493c6
gloo
glenn-jocher Jun 18, 2021
fb342fc
cleanup
glenn-jocher Jun 18, 2021
382ce4f
fix getenv
glenn-jocher Jun 18, 2021
b09b415
cleanup
glenn-jocher Jun 18, 2021
9c4ac05
cleanup destroy
glenn-jocher Jun 18, 2021
8ae9ea1
try nccl
glenn-jocher Jun 18, 2021
a18f933
merge master
glenn-jocher Jun 19, 2021
2435775
return opt
glenn-jocher Jun 19, 2021
56a4ab4
add --local_rank
glenn-jocher Jun 19, 2021
c4d839b
add timeout
glenn-jocher Jun 19, 2021
0584e7e
add init_method
glenn-jocher Jun 19, 2021
d917341
gloo
glenn-jocher Jun 19, 2021
6a1cc64
move destroy
glenn-jocher Jun 19, 2021
3581c76
move destroy
glenn-jocher Jun 19, 2021
5f5d122
move print(opt) under if RANK
glenn-jocher Jun 19, 2021
5451fc2
destroy only RANK 0
glenn-jocher Jun 19, 2021
9aa229e
move destroy inside train()
glenn-jocher Jun 19, 2021
94363ce
restore destroy outside train()
glenn-jocher Jun 19, 2021
9647379
update print(opt)
glenn-jocher Jun 19, 2021
cb8395d
merge master
glenn-jocher Jun 19, 2021
96686fd
cleanup
glenn-jocher Jun 19, 2021
446c610
nccl
glenn-jocher Jun 19, 2021
49bb0b7
gloo with 60 second timeout
glenn-jocher Jun 19, 2021
b5decde
update namespace printing
glenn-jocher Jun 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
gloo
  • Loading branch information
glenn-jocher committed Jun 18, 2021
commit e8493c6065c27b6dd36a521aa372a9e9d115fd0b
2 changes: 1 addition & 1 deletion train.py
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ def train(hyp, # path/to/hyp.yaml or hyp dictionary
assert torch.cuda.device_count() > LOCAL_RANK
torch.cuda.set_device(LOCAL_RANK)
device = torch.device('cuda', LOCAL_RANK)
dist.init_process_group(backend="nccl") # distributed backend
dist.init_process_group(backend="gloo") # distributed backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl should be the faster backend for ddp. I recall that Windows only support gloo however.

assert opt.batch_size % WORLD_SIZE == 0, '--batch-size must be multiple of CUDA device count'
assert not opt.image_weights, '--image-weights argument is not compatible with DDP training'
opt.batch_size = opt.total_batch_size // WORLD_SIZE
Expand Down