-
-
Notifications
You must be signed in to change notification settings - Fork 16.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Focus.forward()
restructure
#4807
Conversation
Focus layer restructure to reduce memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged). Before/after profiling code: ```python class Focus(nn.Module): # Focus wh information into c-space def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups super().__init__() self.conv = Conv(c1 * 4, c2, k, s, p, g, act) def forward(self, x): # x(b,c,w,h) -> y(b,4c,w/2,h/2) return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1)) class FocusAlternate(nn.Module): # Focus wh information into c-space def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups super().__init__() self.conv = Conv(c1 * 4, c2, k, s, p, g, act) def forward(self, x): # x(b,c,w,h) -> y(b,4c,w/2,h/2) a, b = x[:, :, ::2], x[:, :, 1::2] return self.conv(torch.cat([a[..., ::2], b[..., ::2], a[..., 1::2], b[..., 1::2]], 1)) m1 = Focus(3, 64, 3) # YOLOv5 Focus layer m2 = FocusAlternate(3, 64, 3) results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2], n=10, device=0) # profile both 10 times at batch-size 16 ``` Results: ```python YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB) Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output 7040 23.07 2.280 16.2 50.71 (16, 3, 640, 640) (16, 64, 320, 320) 7040 23.07 1.919 15.21 48.6 (16, 3, 640, 640) (16, 64, 320, 320) ```
EDIT: On closer inspection it seems the improvements are due to a first-instance profiling bug. The first module profiled always seems to use more resources, with subsequent testing showing no differences between the two options. Setting this aside for now as this needs some more investigating. m1 = Focus(3, 64, 3) # YOLOv5 Focus layer
m2 = FocusAlternate(3, 64, 3)
results = profile(input=torch.randn(16, 3, 640, 640), ops=[m1, m2, m1, m2], n=10, device=0) # profile both 10 times at batch-size 16
YOLOv5 🚀 v5.0-433-g621b6d5 torch 1.9.0+cu102 CUDA:0 (Tesla T4, 15109.75MB)
Params GFLOPs GPU_mem (GB) forward (ms) backward (ms) input output
7040 23.07 2.259 16.5 52.84 (16, 3, 640, 640) (16, 64, 320, 320)
7040 23.07 1.919 15.19 47.94 (16, 3, 640, 640) (16, 64, 320, 320)
7040 23.07 1.919 15.25 46.27 (16, 3, 640, 640) (16, 64, 320, 320)
7040 23.07 1.919 15.32 47.99 (16, 3, 640, 640) (16, 64, 320, 320) |
I have noticed containers failing to start when we are close to the cuda memory limit and we just try to keep a little extra These are 4 different models across 4 rtx titan cards and we end up really close to 1450 after things are started and processing frames. Reduced memory would be interesting espically with MIG cutting up A100 memory, I have even considered just buying the 80GB version if A100s for future machines so memory will not be an issue.
|
@iceisfun unfortunately the above memory savings are due to a bug in our profile() function where the first run always uses slightly more resources. But I've got some good news on the memory. We are working on some small updates to the architecture in our upcoming v6.0 release scheduled for October 12th. We're still running tests, but so far one of these changes is showing slightly reduced CUDA memory usage. |
Focus layer restructure to reduce CUDA memory usage and eliminate ops in ONNX and CoreML exports (TFLite unchanged).
Before/after profiling code:
Results:
🛠️ PR Summary
Made with ❤️ by Ultralytics Actions
🌟 Summary
Refinement of Focus module in YOLOv5's neural network architecture.
📊 Key Changes
forward
method.🎯 Purpose & Impact
forward
method streamlines the slicing of the input tensor, potentially improving readability and efficiency.