CUDA multi-GPU --device bug #1695

huangfeng95 · 2020-12-15T13:52:00Z

❔Question

I specified parameters device but could not work.
Can you please tell me what to troubleshoot?

Additional context

github-actions · 2020-12-15T13:52:47Z

Hello @huangfeng95, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2020-12-15T16:36:40Z

@huangfeng95 yes there is a bug in our --device selection method now in more recent versions of torch it seems. Current workaround is to can define visible devices before running a command:

$ CUDA_VISIBLE_DEVICES=2,3 python train.py ...

or 

$ export CUDA_VISIBLE_DEVICES=2,3
$ python train.py ...

TODO: CUDA multi-gpu --device bug.

glenn-jocher · 2020-12-15T20:48:24Z

@NanoCode012 do you have any insight into this bug? The only thing I can think of is that the torch functionality changed in a recent release, and our current method of setting CUDA_VISIBLE_DEVICES right before checking devices no longer works. The relevant function is here, I'm kind of empty on ideas for a fix.

yolov5/utils/torch_utils.py

Lines 46 to 69 in e92245a

    
           def select_device(device='', batch_size=None): 
        
               # device = 'cpu' or '0' or '0,1,2,3' 
        
               cpu_request = device.lower() == 'cpu' 
        
               if device and not cpu_request:  # if device requested other than 'cpu' 
        
                   os.environ['CUDA_VISIBLE_DEVICES'] = device  # set environment variable 
        
                   assert torch.cuda.is_available(), 'CUDA unavailable, invalid device %s requested' % device  # check availablity 
        
               cuda = False if cpu_request else torch.cuda.is_available() 
        
               if cuda: 
        
                   c = 1024 ** 2  # bytes to MB 
        
                   ng = torch.cuda.device_count() 
        
                   if ng > 1 and batch_size:  # check that batch_size is compatible with device_count 
        
                       assert batch_size % ng == 0, 'batch-size %g not multiple of GPU count %g' % (batch_size, ng) 
        
                   x = [torch.cuda.get_device_properties(i) for i in range(ng)] 
        
                   s = f'Using torch {torch.__version__} ' 
        
                   for i in range(0, ng): 
        
                       if i == 1: 
        
                           s = ' ' * len(s) 
        
                       logger.info("%sCUDA:%g (%s, %dMB)" % (s, i, x[i].name, x[i].total_memory / c)) 
        
               else: 
        
                   logger.info(f'Using torch {torch.__version__} CPU') 
        
               logger.info('')  # skip a line 
        
               return torch.device('cuda:0' if cuda else 'cpu')

glenn-jocher · 2020-12-15T21:12:19Z

>>> import os
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1,2'
>>> os.environ['CUDA_VISIBLE_DEVICES'] = '0,1'
>>> os.environ['CUDA_VISIBLE_DEVICES']
'0,1'
>>> torch.cuda.device_count()
3
>>> del(torch)
>>> import torch
>>> torch.cuda.device_count()
3
>>> torch.cuda.init()
>>> torch.cuda.device_count()
3

NanoCode012 · 2020-12-16T03:10:15Z

@glenn-jocher I'm testing my env right now. Both shows the same problem. I think something might have happened to code base then because I have not updated any of my environments for quite a while.

py 3.7 and torch 1.6
docker py 3.6 and torch 1.7

Edit: Hm, running the above from code terminal produced confusing results. Will go test some more.

NanoCode012 · 2020-12-16T03:58:35Z

@glenn-jocher , good news (?) ! I tracked down the commit which caused this error (by manually going through commits daily.) Luckily, it was only a few days ago. It was the Profile PR ada90e3 .

To reproduce:

# Working on 11th Dec
git checkout -b 11a 94a7f55c4e5cca3dfe4de0bd0793173d5b152ec5
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1

# Error on 12th Dec
git checkout -b 12 ada90e3901da9d24f88e4d1378be96265770a932
python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt --nosave --cache --device 0,1

I am not sure what exactly in this PR that caused this change because it seems so isolated.

After some tests, it was this. You assigned the device in the function declaration.. I'll submit a PR to this..

yolov5/utils/torch_utils.py

Line 78 in 69ea70c

    
           def profile(x, ops, n=100, device=torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')):

NanoCode012 · 2020-12-16T04:18:53Z

@huangfeng95 @glenn-jocher , could you please test the PR and tell whether it worked for you?

I tested that this change worked on both of my environment above. What a nasty bug!

git clone https://github.com/NanoCode012/yolov5.git -b gpu-fix && cd yolov5
python train.py --epochs 3 --device 0 # Single
python train.py --epochs 3 --device 0,1 # DP
python -m torch.distributed.launch --nproc_per_node 2 train.py --device 0,1 # DDP

glenn-jocher · 2020-12-16T04:28:17Z

@NanoCode012 wow!! We should give you a medal for finding this bug!

That's very bizarre. The profile() function is not actually used anywhere during normal repo operation, it's only there for manual use when comparing two modules, for example when developing new architectures. I used it here to evaluate a Focus() module alternative #1274 (comment). It does all the dirty work for you and shows you the time impact forward backward.

I'll test the PR.

glenn-jocher · 2020-12-16T04:43:12Z

@huangfeng95 this bug should be resolved now thanks to PR #1701 by @NanoCode012. Please git pull to receive this update and let us know if you have any other problems!

huangfeng95 added the question Further information is requested label Dec 15, 2020

glenn-jocher changed the title ~~It seems that i could not choose the specific GPU~~ CUDA multi-gpu --device bug Dec 15, 2020

glenn-jocher added bug Something isn't working TODO labels Dec 15, 2020

glenn-jocher self-assigned this Dec 15, 2020

glenn-jocher changed the title ~~CUDA multi-gpu --device bug~~ CUDA multi-GPU --device bug Dec 15, 2020

NanoCode012 mentioned this issue Dec 16, 2020

Fix torch multi-GPU --device error #1701

Merged

glenn-jocher closed this as completed in #1701 Dec 16, 2020

glenn-jocher removed the TODO label Dec 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA multi-GPU --device bug #1695

CUDA multi-GPU --device bug #1695

huangfeng95 commented Dec 15, 2020

github-actions bot commented Dec 15, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Dec 15, 2020 •

edited

Loading

glenn-jocher commented Dec 15, 2020

glenn-jocher commented Dec 15, 2020 •

edited

Loading

NanoCode012 commented Dec 16, 2020 •

edited

Loading

NanoCode012 commented Dec 16, 2020

NanoCode012 commented Dec 16, 2020 •

edited

Loading

glenn-jocher commented Dec 16, 2020

glenn-jocher commented Dec 16, 2020

CUDA multi-GPU --device bug #1695

CUDA multi-GPU --device bug #1695

Comments

huangfeng95 commented Dec 15, 2020

❔Question

Additional context

github-actions bot commented Dec 15, 2020 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Dec 15, 2020 • edited Loading

glenn-jocher commented Dec 15, 2020

glenn-jocher commented Dec 15, 2020 • edited Loading

NanoCode012 commented Dec 16, 2020 • edited Loading

NanoCode012 commented Dec 16, 2020

NanoCode012 commented Dec 16, 2020 • edited Loading

glenn-jocher commented Dec 16, 2020

glenn-jocher commented Dec 16, 2020

github-actions bot commented Dec 15, 2020 •

edited by glenn-jocher

Loading

glenn-jocher commented Dec 15, 2020 •

edited

Loading

glenn-jocher commented Dec 15, 2020 •

edited

Loading

NanoCode012 commented Dec 16, 2020 •

edited

Loading

NanoCode012 commented Dec 16, 2020 •

edited

Loading