Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA multi-GPU --device bug #1695

Closed
huangfeng95 opened this issue Dec 15, 2020 · 9 comments · Fixed by #1701
Closed

CUDA multi-GPU --device bug #1695

huangfeng95 opened this issue Dec 15, 2020 · 9 comments · Fixed by #1701
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@huangfeng95
Copy link

❔Question

I specified parameters device but could not work.
Can you please tell me what to troubleshoot?

Additional context

image

@huangfeng95 huangfeng95 added the question Further information is requested label Dec 15, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Dec 15, 2020

Hello @huangfeng95, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 15, 2020

@huangfeng95 yes there is a bug in our --device selection method now in more recent versions of torch it seems. Current workaround is to can define visible devices before running a command:

$ CUDA_VISIBLE_DEVICES=2,3 python train.py ...

or 

$ export CUDA_VISIBLE_DEVICES=2,3
$ python train.py ...

TODO: CUDA multi-gpu --device bug.

@glenn-jocher glenn-jocher changed the title It seems that i could not choose the specific GPU CUDA multi-gpu --device bug Dec 15, 2020
@glenn-jocher glenn-jocher added bug Something isn't working TODO labels Dec 15, 2020
@glenn-jocher glenn-jocher self-assigned this Dec 15, 2020
@glenn-jocher glenn-jocher changed the title CUDA multi-gpu --device bug CUDA multi-GPU --device bug Dec 15, 2020
@glenn-jocher
Copy link
Member

@NanoCode012 do you have any insight into this bug? The only thing I can think of is that the torch functionality changed in a recent release, and our current method of setting CUDA_VISIBLE_DEVICES right before checking devices no longer works. The relevant function is here, I'm kind of empty on ideas for a fix.

def select_device(device='', batch_size=None):
# device = 'cpu' or '0' or '0,1,2,3'
cpu_request = device.lower() == 'cpu'
if device and not cpu_request: # if device requested other than 'cpu'
os.environ['CUDA_VISIBLE_DEVICES'] = device # set environment variable
assert torch.cuda.is_available(), 'CUDA unavailable, invalid device %s requested' % device # check availablity
cuda = False if cpu_request else torch.cuda.is_available()
if cuda:
c = 1024 ** 2 # bytes to MB
ng = torch.cuda.device_count()
if ng > 1 and batch_size: # check that batch_size is compatible with device_count
assert batch_size % ng == 0, 'batch-size %g not multiple of GPU count %g' % (batch_size, ng)
x = [torch.cuda.get_device_properties(i) for i in range(ng)]
s = f'Using torch {torch.__version__} '
for i in range(0, ng):
if i == 1:
s = ' ' * len(s)
logger.info("%sCUDA:%g (%s, %dMB)" % (s, i, x[i].name, x[i].total_memory / c))
else:
logger.info(f'Using torch {torch.__version__} CPU')
logger.info('') # skip a line
return torch.device('cuda:0' if cuda else 'cpu')

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 15, 2020

>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1,2'
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1'
>>> torch.cuda.device_count()
3
>>> del(torch)
>>> import torch
>>> torch.cuda.device_count()
3
>>> torch.cuda.init()
>>> torch.cuda.device_count()
3

@NanoCode012
Copy link
Contributor

NanoCode012 commented Dec 16, 2020

@glenn-jocher I'm testing my env right now. Both shows the same problem. I think something might have happened to code base then because I have not updated any of my environments for quite a while.

  • py 3.7 and torch 1.6
  • docker py 3.6 and torch 1.7

Edit: Hm, running the above from code terminal produced confusing results. Will go test some more.

@NanoCode012
Copy link
Contributor

@glenn-jocher , good news (?) ! I tracked down the commit which caused this error (by manually going through commits daily.) Luckily, it was only a few days ago. It was the Profile PR ada90e3 .

To reproduce:

# Working on 11th Dec
git checkout -b 11a 94a7f55c4e5cca3dfe4de0bd0793173d5b152ec5
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1

# Error on 12th Dec
git checkout -b 12 ada90e3901da9d24f88e4d1378be96265770a932
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1

I am not sure what exactly in this PR that caused this change because it seems so isolated.

After some tests, it was this. You assigned the device in the function declaration.. I'll submit a PR to this..

def profile(x, ops, n=100, device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')):

@NanoCode012
Copy link
Contributor

NanoCode012 commented Dec 16, 2020

@huangfeng95 @glenn-jocher , could you please test the PR and tell whether it worked for you?

I tested that this change worked on both of my environment above. What a nasty bug!

git clone https://github.com/NanoCode012/yolov5.git -b gpu-fix && cd yolov5
python train.py --epochs 3 --device 0 # Single
python train.py --epochs 3 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --device 0,1 # DDP

@glenn-jocher
Copy link
Member

@NanoCode012 wow!! We should give you a medal for finding this bug!

That's very bizarre. The profile() function is not actually used anywhere during normal repo operation, it's only there for manual use when comparing two modules, for example when developing new architectures. I used it here to evaluate a Focus() module alternative #1274 (comment). It does all the dirty work for you and shows you the time impact forward backward.

I'll test the PR.

@glenn-jocher
Copy link
Member

@huangfeng95 this bug should be resolved now thanks to PR #1701 by @NanoCode012. Please git pull to receive this update and let us know if you have any other problems!

@glenn-jocher glenn-jocher removed the TODO label Dec 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants