Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

subprocess.CalledProcessError with Multi-GPU Training #501

Closed
yiwangde opened this issue Jul 24, 2020 · 4 comments · Fixed by #504
Closed

subprocess.CalledProcessError with Multi-GPU Training #501

yiwangde opened this issue Jul 24, 2020 · 4 comments · Fixed by #504
Labels
question Further information is requested

Comments

@yiwangde
Copy link

yiwangde commented Jul 24, 2020

❔Question

I want to train on my 2 machines, each machine have 4 GPUs, and I change the code:
dist.init_process_group(backend='nccl', init_method='tcp://' + opt.ip, world_size=2, rank=opt.rank)
then :
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=0
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=1

But unfortunately:
Traceback (most recent call last):
File "train.py", line 472, in
train(hyp, tb_writer, opt, device)
File "train.py", line 118, in train
with torch_distributed_zero_first(rank):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/luguangjian/yolov5-master/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8

Traceback (most recent call last):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/luguangjian/.conda/envs/yolov5/bin/python', '-u', 'train.py', '--local_rank=3', '--ip=125.217.41.155:8878', '--rank=1']' returned non-zero exit status 1.

Has anyone had the same problem?What should I do?

Additional context

@yiwangde yiwangde added the question Further information is requested label Jul 24, 2020
@NanoCode012
Copy link
Contributor

NanoCode012 commented Jul 24, 2020

Hi @yiwangde , when writing the code, we expected it to run on only one node/machine. Although it wouldn't be too hard to update it, I do not have machines to test this on..

Edit: I am working on a PR for this.

@NanoCode012
Copy link
Contributor

NanoCode012 commented Jul 24, 2020

Hi @yiwangde , I've made the PR. Could you please tell me if it works for you? I could only test on my own machine from two terminals.
Repo: https://github.com/NanoCode012/yolov5/tree/muti_node

Command for you:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="125.217.41.155" --master_port=8888 train.py ...
# node 2
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="125.217.41.155" --master_port=8888 train.py ...

@yiwangde
Copy link
Author

yiwangde commented Jul 27, 2020 via email

@yiwangde
Copy link
Author

thanks for you help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants