-
-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA multi-GPU --device bug #1695
Comments
Hello @huangfeng95, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution. If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you. If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available. For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com. RequirementsPython 3.8 or later with all requirements.txt dependencies installed, including $ pip install -r requirements.txt EnvironmentsYOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
StatusIf this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit. |
@huangfeng95 yes there is a bug in our --device selection method now in more recent versions of torch it seems. Current workaround is to can define visible devices before running a command: $ CUDA_VISIBLE_DEVICES=2,3 python train.py ...
or
$ export CUDA_VISIBLE_DEVICES=2,3
$ python train.py ... TODO: CUDA multi-gpu --device bug. |
@NanoCode012 do you have any insight into this bug? The only thing I can think of is that the torch functionality changed in a recent release, and our current method of setting CUDA_VISIBLE_DEVICES right before checking devices no longer works. The relevant function is here, I'm kind of empty on ideas for a fix. Lines 46 to 69 in e92245a
|
>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1,2'
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1'
>>> torch.cuda.device_count()
3
>>> del(torch)
>>> import torch
>>> torch.cuda.device_count()
3
>>> torch.cuda.init()
>>> torch.cuda.device_count()
3 |
@glenn-jocher I'm testing my env right now. Both shows the same problem. I think something might have happened to code base then because I have not updated any of my environments for quite a while.
Edit: Hm, running the above from code terminal produced confusing results. Will go test some more. |
@glenn-jocher , good news (?) ! I tracked down the commit which caused this error (by manually going through commits daily.) Luckily, it was only a few days ago. It was the Profile PR ada90e3 . To reproduce: # Working on 11th Dec
git checkout -b 11a 94a7f55c4e5cca3dfe4de0bd0793173d5b152ec5
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1
# Error on 12th Dec
git checkout -b 12 ada90e3901da9d24f88e4d1378be96265770a932
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1 I am not sure what exactly in this PR that caused this change because it seems so isolated. After some tests, it was this. You assigned the device in the function declaration.. I'll submit a PR to this.. Line 78 in 69ea70c
|
@huangfeng95 @glenn-jocher , could you please test the PR and tell whether it worked for you? I tested that this change worked on both of my environment above. What a nasty bug! git clone https://github.com/NanoCode012/yolov5.git -b gpu-fix && cd yolov5
python train.py --epochs 3 --device 0 # Single
python train.py --epochs 3 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --device 0,1 # DDP |
@NanoCode012 wow!! We should give you a medal for finding this bug! That's very bizarre. The profile() function is not actually used anywhere during normal repo operation, it's only there for manual use when comparing two modules, for example when developing new architectures. I used it here to evaluate a Focus() module alternative #1274 (comment). It does all the dirty work for you and shows you the time impact forward backward. I'll test the PR. |
@huangfeng95 this bug should be resolved now thanks to PR #1701 by @NanoCode012. Please git pull to receive this update and let us know if you have any other problems! |
❔Question
I specified parameters device but could not work.
Can you please tell me what to troubleshoot?
Additional context
The text was updated successfully, but these errors were encountered: