Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Focus.forward() restructure #4807

Closed
wants to merge 1 commit into from
Closed

Focus.forward() restructure #4807

wants to merge 1 commit into from

Conversation

glenn-jocher
Copy link
Member

@glenn-jocher glenn-jocher commented Sep 15, 2021

Focus layer restructure to reduce CUDA memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged).

Before/after profiling code:

import torch
import torch.nn as nn

from models.common import *
from utils.torch_utils import profile

class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

class FocusAlternate(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        a, b = x[:, :, ::2], x[:, :, 1::2]
        return self.conv(torch.cat([a[..., ::2], b[..., ::2], a[..., 1::2], b[..., 1::2]], 1))


m1 = Focus(3, 64, 3)  # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)
results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2], n=10, device=0)  # profile both 10 times at batch-size 16

Results:

YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
        7040       23.07         2.280          16.2         50.71       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.21          48.6       (16, 3, 640, 640)      (16, 64, 320, 320)

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Refinement of Focus module in YOLOv5's neural network architecture.

📊 Key Changes

  • Updated the comment for the Focus class to clarify its function.
  • Simplified the slicing and concatenation operation within the Focus class forward method.

🎯 Purpose & Impact

  • 🎨 Clarification: The change in the comment makes the purpose of the Focus module clearer, emphasizing its role in condensing spatial to channel information.
  • 🧠 Code Optimization: Modifying the forward method streamlines the slicing of the input tensor, potentially improving readability and efficiency.
  • 🚀 Potential Impact: Users may experience minor performance improvements during training and inference due to the more efficient code. The update could also enhance maintainability and understanding of the code.

Focus layer restructure to reduce memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged). 

Before/after profiling code:
```python
class Focus(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))

class FocusAlternate(nn.Module):
    # Focus wh information into c-space
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super().__init__()
        self.conv = Conv(c1 * 4, c2, k, s, p, g, act)

    def forward(self, x):  # x(b,c,w,h) -> y(b,4c,w/2,h/2)
        a, b = x[:, :, ::2], x[:, :, 1::2]
        return self.conv(torch.cat([a[..., ::2], b[..., ::2], a[..., 1::2], b[..., 1::2]], 1))


m1 = Focus(3, 64, 3)  # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)

results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2], n=10, device=0)  # profile both 10 times at batch-size 16
```

Results:
```python
YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
        7040       23.07         2.280          16.2         50.71       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.21          48.6       (16, 3, 640, 640)      (16, 64, 320, 320)
```
@glenn-jocher glenn-jocher self-assigned this Sep 15, 2021
@glenn-jocher
Copy link
Member Author

glenn-jocher commented Sep 15, 2021

EDIT: On closer inspection it seems the improvements are due to a first-instance profiling bug. The first module profiled always seems to use more resources, with subsequent testing showing no differences between the two options. Setting this aside for now as this needs some more investigating.

m1 = Focus(3, 64, 3)  # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)
results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2, m1, m2], n=10, device=0)  # profile both 10 times at batch-size 16

YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)

      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
        7040       23.07         2.259          16.5         52.84       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.19         47.94       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.25         46.27       (16, 3, 640, 640)      (16, 64, 320, 320)
        7040       23.07         1.919         15.32         47.99       (16, 3, 640, 640)      (16, 64, 320, 320)

@iceisfun
Copy link

iceisfun commented Sep 27, 2021

I have noticed containers failing to start when we are close to the cuda memory limit and we just try to keep a little extra

These are 4 different models across 4 rtx titan cards and we end up really close to 1450 after things are started and processing frames.

Reduced memory would be interesting espically with MIG cutting up A100 memory, I have even considered just buying the 80GB version if A100s for future machines so memory will not be an issue.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     19814      C   python                           1441MiB |
|    0   N/A  N/A     19939      C   python                           1441MiB |
|    0   N/A  N/A     20099      C   python                           1441MiB |
|    0   N/A  N/A     20272      C   python                           1441MiB |
|    0   N/A  N/A     22363      C   python                           1443MiB |
|    0   N/A  N/A     22524      C   python                           1443MiB |
|    0   N/A  N/A     22682      C   python                           1443MiB |
|    0   N/A  N/A     22840      C   python                           1443MiB |
|    0   N/A  N/A     24895      C   python                           1441MiB |
|    0   N/A  N/A     25046      C   python                           1441MiB |
|    0   N/A  N/A     25205      C   python                           1441MiB |
|    0   N/A  N/A     25355      C   python                           1441MiB |
|    0   N/A  N/A     27459      C   python                           1445MiB |
|    0   N/A  N/A     27613      C   python                           1441MiB |
|    0   N/A  N/A     27770      C   python                           1441MiB |
|    0   N/A  N/A     27931      C   python                           1441MiB |
|    1   N/A  N/A     20429      C   python                           1441MiB |
|    1   N/A  N/A     20589      C   python                           1441MiB |
|    1   N/A  N/A     20741      C   python                           1441MiB |
|    1   N/A  N/A     20902      C   python                           1445MiB |
|    1   N/A  N/A     22998      C   python                           1447MiB |
|    1   N/A  N/A     23159      C   python                           1447MiB |
|    1   N/A  N/A     23317      C   python                           1447MiB |
|    1   N/A  N/A     23477      C   python                           1447MiB |
|    1   N/A  N/A     25560      C   python                           1441MiB |
|    1   N/A  N/A     25736      C   python                           1441MiB |
|    1   N/A  N/A     25893      C   python                           1441MiB |
|    1   N/A  N/A     26057      C   python                           1445MiB |
|    1   N/A  N/A     28092      C   python                           1441MiB |
|    1   N/A  N/A     28257      C   python                           1441MiB |
|    1   N/A  N/A     28414      C   python                           1441MiB |
|    1   N/A  N/A     28562      C   python                           1441MiB |
|    2   N/A  N/A     21062      C   python                           1445MiB |
|    2   N/A  N/A     21223      C   python                           1445MiB |
|    2   N/A  N/A     21385      C   python                           1445MiB |
|    2   N/A  N/A     21539      C   python                           1445MiB |
|    2   N/A  N/A     23630      C   python                           1447MiB |
|    2   N/A  N/A     23794      C   python                           1447MiB |
|    2   N/A  N/A     23950      C   python                           1443MiB |
|    2   N/A  N/A     24106      C   python                           1443MiB |
|    2   N/A  N/A     26203      C   python                           1445MiB |
|    2   N/A  N/A     26359      C   python                           1445MiB |
|    2   N/A  N/A     26511      C   python                           1445MiB |
|    2   N/A  N/A     26664      C   python                           1445MiB |
|    2   N/A  N/A     28730      C   python                           1441MiB |
|    2   N/A  N/A     28874      C   python                           1441MiB |
|    2   N/A  N/A     29016      C   python                           1441MiB |
|    2   N/A  N/A     29180      C   python                           1441MiB |
|    3   N/A  N/A     21702      C   python                           1445MiB |
|    3   N/A  N/A     21862      C   python                           1441MiB |
|    3   N/A  N/A     22046      C   python                           1441MiB |
|    3   N/A  N/A     22200      C   python                           1441MiB |
|    3   N/A  N/A     24256      C   python                           1443MiB |
|    3   N/A  N/A     24419      C   python                           1443MiB |
|    3   N/A  N/A     24572      C   python                           1443MiB |
|    3   N/A  N/A     24729      C   python                           1443MiB |
|    3   N/A  N/A     26832      C   python                           1445MiB |
|    3   N/A  N/A     26982      C   python                           1441MiB |
|    3   N/A  N/A     27144      C   python                           1441MiB |
|    3   N/A  N/A     29318      C   python                           1445MiB |
|    3   N/A  N/A     29482      C   python                           1445MiB |
|    3   N/A  N/A     29627      C   python                           1445MiB |
|    3   N/A  N/A     29790      C   python                           1445MiB |
+-----------------------------------------------------------------------------+

@glenn-jocher
Copy link
Member Author

@iceisfun unfortunately the above memory savings are due to a bug in our profile() function where the first run always uses slightly more resources.

But I've got some good news on the memory. We are working on some small updates to the architecture in our upcoming v6.0 release scheduled for October 12th. We're still running tests, but so far one of these changes is showing slightly reduced CUDA memory usage.

@glenn-jocher glenn-jocher deleted the update/Focus branch October 7, 2021 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants