Skip to content

jiawangbai/HAT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


HAT

Implementation of HAT https://arxiv.org/pdf/2204.00993

@inproceedings{bai2022improving,
  title={Improving Vision Transformers by Revisiting High-frequency Components},
  author={Bai, Jiawang and Yuan, Li and Xia, Shu-Tao and Yan, Shuicheng and Li, Zhifeng and Liu, Wei},
  booktitle={European Conference on Computer Vision},
  year={2022}
}

Requirements

torch>=1.7.0
torchvision>=0.8.0
timm==0.4.5
tlt==0.1.0
pyyaml
apex-amp

ImageNet Classification

Data Preparation

We use the ImageNet-1K training and validation datasets by default. Please save them in [your_imagenet_path].

Training

Training ViT models with HAT using the default settings in our paper on 8 GPUs:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model [your_vit_model_name] \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
and_other_parameters_specified_for_your_vit_models...

For instance, we train Swin-T with the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--adv-epochs 200 \
--adv-iters 3 \
--adv-eps 0.00784314 \
--adv-kl-weight 0.01 \
--adv-ce-weight 3.0 \
--output [your_output_path] \
--batch-size 256 \
--drop-path 0.2 \
--lr 1e-3 \
--weight-decay 0.05 \
--clip-grad 1.0

For training variants of ViT, Swin Transformer, VOLO, we use the hyper-parameters in [3], [4], and [2], respectively.

We also combine HAT with knowledge distillation in [5], using train_kd.py.

Validation

After training, we can use validate.py to evaluate the ViT model trained with HAT.

For instance, we evaluate Swin-T with the following command:

python3 -u validate.py \
--data_dir [your_imagenet_path] \
--model swin_tiny_patch4_window7_224 \
--checkpoint [your_checkpoint_path] \
--batch-size 128 \
--num-gpu 8 \
--apex-amp \
--results-file [your_results_file_path]

Results

Model Params FLOPs Test Size Top-1 +HAT Top-1 Download
ViT-T 5.7M 1.6G 224 72.2 73.3 link
ViT-S 22.1M 4.7G 224 80.1 80.9 link
ViT-B 86.6M 17.6G 224 82.0 83.2 link
Swin-T 28.3M 4.5G 224 81.2 82.0 link
Swin-S 49.6M 8.7G 224 83.0 83.3 link
Swin-B 87.8M 15.4G 224 83.5 84.0 link
VOLO-D1 26.6M 6.8G 224 84.2 84.5 link
VOLO-D1 26.6M 22.8G 384 85.2 85.5 link
VOLO-D5 295.5M 69.0G 224 86.1 86.3 link
VOLO-D5 295.5M 304G 448 87.0 87.2 link
VOLO-D5 295.5M 412G 512 87.1 87.3 link

The result of combining HAT with knowledge distillation in [5] is 84.3% for ViT-B, and it can be downloaded here.

Downstream Tasks

We first pretrain Swin-T/S/B on the ImageNet-1k dataset with our proposed HAT, and then transfer the models to the downstream tasks, including object detection, instance segmentation, and semantic segmentation.

We use the codes in Swin Transformer for Object Detection and Swin Transformer for Semantic Segmentaion, and follow their configurations.

Cascade Mask R-CNN on COCO val 2017

Backbone Params FLOPs Config AP_box +HAT AP_box AP_mask +HAT AP_mask
Swin-T 86M 745G config 50.5 50.9 43.7 43.9
Swin-S 107M 838G config 51.8 52.5 44.7 45.4
Swin-B 145M 982G config 51.9 52.8 45.0 45.6

UperNet on ADE20K

Backbone Params FLOPs Config mIoU(MS) +HAT mIoU(MS)
Swin-T 60M 945G config 46.1 46.7
Swin-S 81M 1038G config 49.5 49.7
Swin-B 121M 1088G config 49.7 50.3

[1] Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models , 2019.
[2] Yuan, L. et al. Volo: Vision outlooker for visual recognition. arXiv, 2021.
[3] Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2020.
[4] Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[5] Touvron H. et al. Training data-efficient image transformers & distillation through attention. ICML, 2021.