subprocess.CalledProcessError with Multi-GPU Training #501

yiwangde · 2020-07-24T09:29:31Z

❔Question

I want to train on my 2 machines, each machine have 4 GPUs， and I change the code:
dist.init_process_group(backend='nccl', init_method='tcp://' + opt.ip, world_size=2, rank=opt.rank)
then ：
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=0
$python -m torch.distributed.launch --nproc_per_node 8 train.py --ip='125.217.41.155:8888' --rank=1

But unfortunately：
Traceback (most recent call last):
File "train.py", line 472, in
train(hyp, tb_writer, opt, device)
File "train.py", line 118, in train
with torch_distributed_zero_first(rank):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/contextlib.py", line 112, in enter
return next(self.gen)
File "/home/luguangjian/yolov5-master/utils/utils.py", line 41, in torch_distributed_zero_first
torch.distributed.barrier()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier() RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:514, unhandled system error, NCCL version 2.4.8

Traceback (most recent call last):
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/luguangjian/.conda/envs/yolov5/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/luguangjian/.conda/envs/yolov5/bin/python', '-u', 'train.py', '--local_rank=3', '--ip=125.217.41.155:8878', '--rank=1']' returned non-zero exit status 1.

Has anyone had the same problem？What should I do?

Additional context

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2020-07-24T09:45:49Z

Hi @yiwangde , when writing the code, we expected it to run on only one node/machine. Although it wouldn't be too hard to update it, I do not have machines to test this on..

Edit: I am working on a PR for this.

NanoCode012 · 2020-07-24T14:14:53Z

Hi @yiwangde , I've made the PR. Could you please tell me if it works for you? I could only test on my own machine from two terminals.
Repo: https://github.com/NanoCode012/yolov5/tree/muti_node

Command for you:

# node 1
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 --master_addr="125.217.41.155" --master_port=8888 train.py ...

# node 2
$ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 --master_addr="125.217.41.155" --master_port=8888 train.py ...

yiwangde · 2020-07-27T00:51:43Z

thanks a lot. I've got it.

…

------------------ 原始邮件 ------------------ 发件人: "NanoCode012"<notifications@github.com>; 发送时间: 2020年7月24日(星期五) 晚上10:15 收件人: "ultralytics/yolov5"<yolov5@noreply.github.com>; 抄送: "1375774056"<1375774056@qq.com>; "Mention"<mention@noreply.github.com>; 主题: Re: [ultralytics/yolov5] subprocess.CalledProcessError with Multi-GPU Training (#501) Hi @yiwangde , I've made the PR. If you want to test, you can do so. Repo: https://github.com/NanoCode012/yolov5/tree/muti_node Command for you: # node 1 $ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=0 train.py ... # node 2 $ python -m torch.distributed.launch --nproc_per_node=4 --nnodes=2 --node_rank=1 train.py ... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

yiwangde · 2020-07-27T00:52:29Z

thanks for you help

yiwangde added the question Further information is requested label Jul 24, 2020

This was referenced Jul 24, 2020

Improvement of DDP is needed! #463

Closed

Add Multi-Node support for DDP Training #504

Merged

yiwangde closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subprocess.CalledProcessError with Multi-GPU Training #501

subprocess.CalledProcessError with Multi-GPU Training #501

yiwangde commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

yiwangde commented Jul 27, 2020 via email

yiwangde commented Jul 27, 2020

subprocess.CalledProcessError with Multi-GPU Training #501

subprocess.CalledProcessError with Multi-GPU Training #501

Comments

yiwangde commented Jul 24, 2020 • edited Loading

❔Question

Additional context

NanoCode012 commented Jul 24, 2020 • edited Loading

NanoCode012 commented Jul 24, 2020 • edited Loading

yiwangde commented Jul 27, 2020 via email

yiwangde commented Jul 27, 2020

yiwangde commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading