`Focus.forward()` restructure #4807

glenn-jocher · 2021-09-15T13:46:54Z

Focus layer restructure to reduce CUDA memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged).

Before/after profiling code:

import torch
import torch.nn as nn

from models.common import *
from utils.torch_utils import profile

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

class FocusAlternate(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        a, b = x[:, :, ::2], x[:, :, 1::2]
        return self.conv(torch.cat([a[..., ::2], b[..., ::2], a[..., 1::2], b[..., 1::2]], 1))


m1 = Focus(3, 64, 3)  # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)
results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2], n=10, device=0)  # profile both 10 times at batch-size 16

Results:

YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
        7040       23.07         2.280          16.2         50.71       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.21          48.6       (16, 3, 640, 640)      (16, 64, 320, 320)

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Refinement of Focus module in YOLOv5's neural network architecture.

📊 Key Changes

Updated the comment for the Focus class to clarify its function.
Simplified the slicing and concatenation operation within the Focus class forward method.

🎯 Purpose & Impact

🎨 Clarification: The change in the comment makes the purpose of the Focus module clearer, emphasizing its role in condensing spatial to channel information.
🧠 Code Optimization: Modifying the forward method streamlines the slicing of the input tensor, potentially improving readability and efficiency.
🚀 Potential Impact: Users may experience minor performance improvements during training and inference due to the more efficient code. The update could also enhance maintainability and understanding of the code.

Focus layer restructure to reduce memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged). Before/after profiling code: ```python class Focus(nn.Module): # Focus wh information into c-space def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups super().__init__() self.conv = Conv(c1 * 4, c2, k, s, p, g, act) def forward(self, x): # x(b,c,w,h) -> y(b,4c,w/2,h/2) return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1)) class FocusAlternate(nn.Module): # Focus wh information into c-space def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups super().__init__() self.conv = Conv(c1 * 4, c2, k, s, p, g, act) def forward(self, x): # x(b,c,w,h) -> y(b,4c,w/2,h/2) a, b = x[:, :, ::2], x[:, :, 1::2] return self.conv(torch.cat([a[..., ::2], b[..., ::2], a[..., 1::2], b[..., 1::2]], 1)) m1 = Focus(3, 64, 3) # YOLOv5 Focus layer m2 = FocusAlternate(3, 64, 3) results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2], n=10, device=0) # profile both 10 times at batch-size 16 ``` Results: ```python YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB) Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output 7040 23.07 2.280 16.2 50.71 (16, 3, 640, 640) (16, 64, 320, 320) 7040 23.07 1.919 15.21 48.6 (16, 3, 640, 640) (16, 64, 320, 320) ```

glenn-jocher · 2021-09-15T14:05:45Z

EDIT: On closer inspection it seems the improvements are due to a first-instance profiling bug. The first module profiled always seems to use more resources, with subsequent testing showing no differences between the two options. Setting this aside for now as this needs some more investigating.

m1 = Focus(3, 64, 3)  # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)
results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2, m1, m2], n=10, device=0)  # profile both 10 times at batch-size 16

YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
        7040       23.07         2.259          16.5         52.84       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.19         47.94       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.25         46.27       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.32         47.99       (16, 3, 640, 640)      (16, 64, 320, 320)

iceisfun · 2021-09-27T02:54:21Z

I have noticed containers failing to start when we are close to the cuda memory limit and we just try to keep a little extra

These are 4 different models across 4 rtx titan cards and we end up really close to 1450 after things are started and processing frames.

Reduced memory would be interesting espically with MIG cutting up A100 memory, I have even considered just buying the 80GB version if A100s for future machines so memory will not be an issue.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19814      C   python                           1441MiB |
|    0   N/A  N/A     19939      C   python                           1441MiB |
|    0   N/A  N/A     20099      C   python                           1441MiB |
|    0   N/A  N/A     20272      C   python                           1441MiB |
|    0   N/A  N/A     22363      C   python                           1443MiB |
|    0   N/A  N/A     22524      C   python                           1443MiB |
|    0   N/A  N/A     22682      C   python                           1443MiB |
|    0   N/A  N/A     22840      C   python                           1443MiB |
|    0   N/A  N/A     24895      C   python                           1441MiB |
|    0   N/A  N/A     25046      C   python                           1441MiB |
|    0   N/A  N/A     25205      C   python                           1441MiB |
|    0   N/A  N/A     25355      C   python                           1441MiB |
|    0   N/A  N/A     27459      C   python                           1445MiB |
|    0   N/A  N/A     27613      C   python                           1441MiB |
|    0   N/A  N/A     27770      C   python                           1441MiB |
|    0   N/A  N/A     27931      C   python                           1441MiB |
|    1   N/A  N/A     20429      C   python                           1441MiB |
|    1   N/A  N/A     20589      C   python                           1441MiB |
|    1   N/A  N/A     20741      C   python                           1441MiB |
|    1   N/A  N/A     20902      C   python                           1445MiB |
|    1   N/A  N/A     22998      C   python                           1447MiB |
|    1   N/A  N/A     23159      C   python                           1447MiB |
|    1   N/A  N/A     23317      C   python                           1447MiB |
|    1   N/A  N/A     23477      C   python                           1447MiB |
|    1   N/A  N/A     25560      C   python                           1441MiB |
|    1   N/A  N/A     25736      C   python                           1441MiB |
|    1   N/A  N/A     25893      C   python                           1441MiB |
|    1   N/A  N/A     26057      C   python                           1445MiB |
|    1   N/A  N/A     28092      C   python                           1441MiB |
|    1   N/A  N/A     28257      C   python                           1441MiB |
|    1   N/A  N/A     28414      C   python                           1441MiB |
|    1   N/A  N/A     28562      C   python                           1441MiB |
|    2   N/A  N/A     21062      C   python                           1445MiB |
|    2   N/A  N/A     21223      C   python                           1445MiB |
|    2   N/A  N/A     21385      C   python                           1445MiB |
|    2   N/A  N/A     21539      C   python                           1445MiB |
|    2   N/A  N/A     23630      C   python                           1447MiB |
|    2   N/A  N/A     23794      C   python                           1447MiB |
|    2   N/A  N/A     23950      C   python                           1443MiB |
|    2   N/A  N/A     24106      C   python                           1443MiB |
|    2   N/A  N/A     26203      C   python                           1445MiB |
|    2   N/A  N/A     26359      C   python                           1445MiB |
|    2   N/A  N/A     26511      C   python                           1445MiB |
|    2   N/A  N/A     26664      C   python                           1445MiB |
|    2   N/A  N/A     28730      C   python                           1441MiB |
|    2   N/A  N/A     28874      C   python                           1441MiB |
|    2   N/A  N/A     29016      C   python                           1441MiB |
|    2   N/A  N/A     29180      C   python                           1441MiB |
|    3   N/A  N/A     21702      C   python                           1445MiB |
|    3   N/A  N/A     21862      C   python                           1441MiB |
|    3   N/A  N/A     22046      C   python                           1441MiB |
|    3   N/A  N/A     22200      C   python                           1441MiB |
|    3   N/A  N/A     24256      C   python                           1443MiB |
|    3   N/A  N/A     24419      C   python                           1443MiB |
|    3   N/A  N/A     24572      C   python                           1443MiB |
|    3   N/A  N/A     24729      C   python                           1443MiB |
|    3   N/A  N/A     26832      C   python                           1445MiB |
|    3   N/A  N/A     26982      C   python                           1441MiB |
|    3   N/A  N/A     27144      C   python                           1441MiB |
|    3   N/A  N/A     29318      C   python                           1445MiB |
|    3   N/A  N/A     29482      C   python                           1445MiB |
|    3   N/A  N/A     29627      C   python                           1445MiB |
|    3   N/A  N/A     29790      C   python                           1445MiB |
+-----------------------------------------------------------------------------+

glenn-jocher · 2021-09-27T15:05:49Z

@iceisfun unfortunately the above memory savings are due to a bug in our profile() function where the first run always uses slightly more resources.

But I've got some good news on the memory. We are working on some small updates to the architecture in our upcoming v6.0 release scheduled for October 12th. We're still running tests, but so far one of these changes is showing slightly reduced CUDA memory usage.

glenn-jocher self-assigned this Sep 15, 2021

glenn-jocher closed this Oct 7, 2021

glenn-jocher deleted the update/Focus branch October 7, 2021 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Focus.forward()` restructure #4807

`Focus.forward()` restructure #4807

glenn-jocher commented Sep 15, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Sep 15, 2021 •

edited

Loading

iceisfun commented Sep 27, 2021 •

edited

Loading

glenn-jocher commented Sep 27, 2021

Focus.forward() restructure #4807

Focus.forward() restructure #4807

Conversation

glenn-jocher commented Sep 15, 2021 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

glenn-jocher commented Sep 15, 2021 • edited Loading

iceisfun commented Sep 27, 2021 • edited Loading

glenn-jocher commented Sep 27, 2021

`Focus.forward()` restructure #4807

`Focus.forward()` restructure #4807

glenn-jocher commented Sep 15, 2021 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Sep 15, 2021 •

edited

Loading

iceisfun commented Sep 27, 2021 •

edited

Loading