Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new activation function ACON that is very simple and effective !! #2891

Closed
nmaac opened this issue Apr 22, 2021 · 26 comments · Fixed by #2893 or #2901
Closed

A new activation function ACON that is very simple and effective !! #2891

nmaac opened this issue Apr 22, 2021 · 26 comments · Fixed by #2893 or #2901
Labels
enhancement New feature or request Stale

Comments

@nmaac
Copy link

nmaac commented Apr 22, 2021

🚀 Feature

There is a new activation function ACON (CVPR 2021) that unifies ReLU and Swish.
ACON is simple but very effective, code is here: https://github.com/nmaac/acon/blob/main/acon.py#L19

image

The improvements are very significant:
image

Motivation

Pitch

I would like to suggest replacing SiLU with ACON directly because SiLU (Swish) is used in your project, its general and effective form ACON may also show improvements.

Alternatives

It also has an enhanced version meta-ACON that uses a small network to learn beta explicitly, which may influence the speed a bit.

Additional context

Code and paper.

@nmaac nmaac added the enhancement New feature or request label Apr 22, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Apr 22, 2021

👋 Hello @nmaac, thank you for your interest in 🚀 YOLOv5! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://www.ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.7. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

@nmaac thanks for the idea, looks promising! Any object detection results so far?

@nmaac
Copy link
Author

nmaac commented Apr 22, 2021

There are some detection results:

image

I did not test it on yolov5, but seems it has the potential to make nearly cost-free improvements, by simply replacing SiLU.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 22, 2021

@nmaac ah great, thank you! Yes this is quite a significant improvement in your Table 9. Which ACON version would you recommend we try, and what values for p1, p2, Beta?

  • meta-ACON
  • ACON-A
  • ACON-B
  • ACON-C

The right place to include a new activation would be utils/activations, and then the place to swap out nn.SiLU() for a new activation is here on L39 of models/common.py

yolov5/models/common.py

Lines 33 to 43 in d48a34d

class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super(Conv, self).__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))

@nmaac
Copy link
Author

nmaac commented Apr 22, 2021

I would like to suggest ACON-C, which improves accuracy without a negligible overhead.

You can use the code directly:

https://github.com/nmaac/acon/blob/8782b65f5d7b3523f656beceb586b54d04019705/acon.py#L4-19

@glenn-jocher glenn-jocher linked a pull request Apr 22, 2021 that will close this issue
@glenn-jocher glenn-jocher reopened this Apr 22, 2021
@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 22, 2021

@nmaac @ilem777 I've added AconC to our activations study here:
https://wandb.ai/glenn-jocher/activations

I just started runs with AconC(), MetaAconC() and FReLU(), you can track their progress live at the link above. Training time will be about 3 days. I tried MetaAconC but ran into issues. The nn.batchnorm2d(16) layers produced errors on inputs of size (1,16,1,1), perhaps I implemented the function incorrectly.

@glenn-jocher
Copy link
Member

@AyushExel I spotted something concerning I was hoping you could look at. When runs are public, like the activation study above, the 'stop run' button appears to work even when the visitor is incognito / no signin.

@AyushExel
Copy link
Contributor

@glenn-jocher thanks for reporting this. I'll check if the button for non-authorized users actually stops the runs or not. If it does then it's a very bad bug otherwise it's just a minor frontend bug. I'll file a ticket for this to get fixed

@glenn-jocher glenn-jocher linked a pull request Apr 22, 2021 that will close this issue
@WongKinYiu
Copy link

it because nn.batchnorm2d need batch size > 1 when training.
the simplest way to solve the problem is change this line to

            m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(2, ch, s, s))])  # forward

@nmaac
Copy link
Author

nmaac commented Apr 23, 2021

@glenn-jocher you can simply remove the two bn
layers in MetaAcon which does not affect the accuracy much.

@glenn-jocher
Copy link
Member

@nmaac oh, I think I misunderstood before. I think you mean to remove self.bn1 and self.bn2 completely from the MetaAconC() module for all batch-sizes?

@glenn-jocher
Copy link
Member

@WongKinYiu yes, this is a good solution too, though will make model creation a bit slower for all other models. The nn.batchnorm2d() layers are ok for batch-size 1 inference?

@glenn-jocher
Copy link
Member

@WongKinYiu @nmaac I'm curious, looking at the ACON implementation have you guys tried simply training with SiLU with Beta? I've never done this before. nn.SiLU() does not allow this but I think I might try testing this using a custom SiLU to see how this affects the results.

@WongKinYiu
Copy link

nn.batchnorm2d() layers can do batch-size 1 inference.
or instancenorm is another choice.

@nmaac
Copy link
Author

nmaac commented Apr 25, 2021

@glenn-jocher SiLU with beta does not show benefits, in the paper Swish-1 and Swish show comparable results when set beta=1, specifically,
Swish: x*sigmoid(beta*x)
Swish-1(SiLU): x*sigmoid(x)

Therefore meta-ACON uses an explicitly way to learn beta which show the improvements.

@glenn-jocher
Copy link
Member

@nmaac understood, thanks for explanation. I had to completely remove the BN layers from MetaAconC otherwise instabilities appeared in the training (two 'STOPPED' runs below). Results should be done in about a day, but based on the current trends it doesn't initially seem like I was able to produce better results with either AconC or MetaAconC. The best performing activation in the study by far was FReLU, though this should be taken with a grain of salt as FReLU is really blurring the lines between an activation and a convolution layer. Due to the added parameters and FLOPS I would also assume FReLU would disproportionately improve smaller models like YOLOv5s, with unclear correlation to improving larger models like YOLOv5x6, which may necessitate a second study in the future.
https://wandb.ai/glenn-jocher/activations

@nmaac
Copy link
Author

nmaac commented Apr 26, 2021

@glenn-jocher Yes the curves show comparable results. Which activations did you change? Did you pre-train the backbone? In my experiments I usually change the activations in the backbone and pre-train the backbone on ImageNet first. I can help with the pre-training if needed :)

@WongKinYiu
Copy link

i think the main reason is that @glenn-jocher forget to add p1, p2, and (beta) into no decay optimized group.
https://github.com/nmaac/acon/blob/8782b65f5d7b3523f656beceb586b54d04019705/ACON/ResNet_ACON/utils.py#L82

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 26, 2021

@nmaac well that's a good question, should the activation function parameters be exempt from weight decay? We use the following parameter groups to exempt .bias parameters and BatchNorm layers from weight decay, so at the moment only the fc1, fc2 biases are exempt from decay.

yolov5/train.py

Lines 115 to 123 in 1849916

pg0, pg1, pg2 = [], [], [] # optimizer parameter groups
for k, v in model.named_modules():
if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):
pg2.append(v.bias) # biases
if isinstance(v, nn.BatchNorm2d):
pg0.append(v.weight) # no decay
elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):
pg1.append(v.weight) # apply decay

The activation function implementations are all in utils/activations:

# ACON https://arxiv.org/pdf/2009.04759.pdf ----------------------------------------------------------------------------
class AconC(nn.Module):
r""" ACON activation (activate or not).
AconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is a learnable parameter
according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>.
"""
def __init__(self, c1):
super().__init__()
self.p1 = nn.Parameter(torch.randn(1, c1, 1, 1))
self.p2 = nn.Parameter(torch.randn(1, c1, 1, 1))
self.beta = nn.Parameter(torch.ones(1, c1, 1, 1))
def forward(self, x):
dpx = (self.p1 - self.p2) * x
return dpx * torch.sigmoid(self.beta * dpx) + self.p2 * x
class MetaAconC(nn.Module):
r""" ACON activation (activate or not).
MetaAconC: (p1*x-p2*x) * sigmoid(beta*(p1*x-p2*x)) + p2*x, beta is generated by a small network
according to "Activate or Not: Learning Customized Activation" <https://arxiv.org/pdf/2009.04759.pdf>.
"""
def __init__(self, c1, k=1, s=1, r=16): # ch_in, kernel, stride, r
super().__init__()
c2 = max(r, c1 // r)
self.p1 = nn.Parameter(torch.randn(1, c1, 1, 1))
self.p2 = nn.Parameter(torch.randn(1, c1, 1, 1))
self.fc1 = nn.Conv2d(c1, c2, k, s, bias=True)
self.fc2 = nn.Conv2d(c2, c1, k, s, bias=True)
# self.bn1 = nn.BatchNorm2d(c2)
# self.bn2 = nn.BatchNorm2d(c1)
def forward(self, x):
y = x.mean(dim=2, keepdims=True).mean(dim=3, keepdims=True)
# batch-size 1 bug/instabilities https://github.com/ultralytics/yolov5/issues/2891
# beta = torch.sigmoid(self.bn2(self.fc2(self.bn1(self.fc1(y))))) # bug/unstable
beta = torch.sigmoid(self.fc2(self.fc1(y))) # bug patch BN layers removed
dpx = (self.p1 - self.p2) * x
return dpx * torch.sigmoid(beta * dpx) + self.p2 * x

The activations_study branch used in this study replaces all activations in the YOLOv5 model by redefining self.act here:

yolov5/models/common.py

Lines 34 to 55 in c9c95fb

class Conv(nn.Module):
# Standard convolution
def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True): # ch_in, ch_out, kernel, stride, padding, groups
super(Conv, self).__init__()
self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
self.bn = nn.BatchNorm2d(c2)
# self.act = nn.Identity() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.Tanh() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.Sigmoid() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.ReLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.LeakyReLU(0.1) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.Hardswish() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = Mish() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = AconC() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = MetaAconC() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
# self.act = SiLU_beta() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
self.act = MetaAconC(c2) if act is True else (act if isinstance(act, nn.Module) else nn.Identity())
def forward(self, x):
return self.act(self.bn(self.conv(x)))

All YOLOv5 models are trained from scratch to 300 epochs using all default settings. Training commands are shown in the W&B link to reproduce (COCO dataset autodownloads):

train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --epochs 300 --img 640 --project activations --name yolov5s-MetaAconC_noBN --device 0

@developer0hye
Copy link
Contributor

@glenn-jocher
Is there any progress on this issue?

@glenn-jocher
Copy link
Member

@developer0hye well I'm not sure. The ACON authors @nmaac didn't answer my question of whether we should exempt some of the ACON parameters from weight decay. The current results are here for all the activations on YOLOv5s: https://wandb.ai/glenn-jocher/activations

@nmaac
Copy link
Author

nmaac commented May 17, 2021

@glenn-jocher @developer0hye In my experiments, the weight decay setting does not affect the results very much.

But I suggest try another initialization approach:

self.p1 = nn.Parameter(torch.normal(1, 0.01, size=(1, width, 1, 1)))
self.p2 = nn.Parameter(torch.normal(0, 0.01, size=(1, width, 1, 1)))
self.beta = nn.Parameter(torch.normal(1, 0.01, size=(1, width, 1, 1)))

@WongKinYiu
Copy link

@nmaac

in my experiments:

  1. original silu ~300 epochs: 51.9% AP

  2. old init ~300 epochs: 50.6% AP

  3. new initial ~300 epochs: 51.5% AP

self.p1 = nn.Parameter(torch.normal(1, 0.01, size=(1, width, 1, 1)))
self.p2 = nn.Parameter(torch.normal(0, 0.01, size=(1, width, 1, 1)))
self.beta = nn.Parameter(torch.normal(1, 0.01, size=(1, width, 1, 1)))

and old initial with decay drops 0.2% AP

@glenn-jocher
Copy link
Member

@nmaac @WongKinYiu got it, thanks guys!

@github-actions
Copy link
Contributor

github-actions bot commented Jun 17, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@iumyx2612
Copy link
Contributor

Any updates on this? How's ACON

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
6 participants