-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
subprocess.CalledProcessError with Multi-GPU Training #501
Comments
Hi @yiwangde , when writing the code, we expected it to run on only one Edit: I am working on a PR for this. |
Hi @yiwangde , I've made the PR. Could you please tell me if it works for you? I could only test on my own machine from two terminals. Command for you: # node 1
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="125.217.41.155" --master_port=8888 train.py ... # node 2
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="125.217.41.155" --master_port=8888 train.py ... |
thanks a lot. I've got it.
…------------------ 原始邮件 ------------------
发件人: "NanoCode012"<notifications@github.com>;
发送时间: 2020年7月24日(星期五) 晚上10:15
收件人: "ultralytics/yolov5"<yolov5@noreply.github.com>;
抄送: "1375774056"<1375774056@qq.com>; "Mention"<mention@noreply.github.com>;
主题: Re: [ultralytics/yolov5] subprocess.CalledProcessError with Multi-GPU Training (#501)
Hi @yiwangde , I've made the PR. If you want to test, you can do so.
Repo: https://github.com/NanoCode012/yolov5/tree/muti_node
Command for you:
# node 1 $ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 train.py ...
# node 2 $ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 train.py ...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
thanks for you help |
❔Question
I want to train on my 2 machines, each machine have 4 GPUs, and I change the code:
dist.init_process_group(backend='nccl', init_method='tcp://' + opt.ip, world_size=2, rank=opt.rank)
then :
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=0
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=1
But unfortunately:
Traceback (most recent call last):
File "train.py", line 472, in
train(hyp, tb_writer, opt, device)
File "train.py", line 118, in train
with torch_distributed_zero_first(rank):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/luguangjian/yolov5-master/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8
Traceback (most recent call last):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/luguangjian/.conda/envs/yolov5/bin/python', '-u', 'train.py', '--local_rank=3', '--ip=125.217.41.155:8878', '--rank=1']' returned non-zero exit status 1.
Has anyone had the same problem?What should I do?
Additional context
The text was updated successfully, but these errors were encountered: