MAFT+ (ECCV 2024 oral)

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Siyu Jiao^1,2,Hongguang Zhu^1,2,Jiannan Huang^1,3, Yao Zhao^1,2, Yunchao Wei^1,2, Humphrey Shi^3,4,

¹ Beijing Jiaotong University, ² Pengcheng Lab, ³ Georgia Institute of Technology, ⁴ Picsart AI Research (PAIR)

[Paper]

Introduction

This work is an enhanced version of our NeurIPS paper MAFT.
Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ.

Installation

Clone the repository

git clone https://github.com/jiaosiyu1999/MAFT_Plus.git

Navigate to the project directory
```
cd MAFT_Plus
```

Install the dependencies

bash install.sh
cd maft/modeling/pixel_decoder/ops
sh make.sh

Data Preparation

See MAFT for reference (Preparing Datasets for MAFT). The data should be organized like:

datasets/
  ade/
      ADEChallengeData2016/
        images/
        annotations_detectron2/
      ADE20K_2021_17_01/
        images/
        annotations_detectron2/
  coco/
        train2017/
        val2017/
        stuffthingmaps_detectron2/
  VOCdevkit/
     VOC2012/
        images_detectron2/
        annotations_ovs/      
    VOC2010/
        images/
        annotations_detectron2_ovs/
            pc59_val/
            pc459_val/

Usage

Pretrained Weights
1. semantic
Model A-847 A-150 PC-459 PC-59 PAS-20 Weights

MAFTP-Base 13.8 34.6 16.2 57.5 95.4 maftp_b.pth

MAFTP-Large 15.1 36.1 21.6 59.4 96.5 maftp_l.pth
1. panoptic
PQ SQ RQ Weights

MAFTP-Large 27.1 73.5 32.9 maftp_l_pano.pth

Evaluation

evaluate trained model on validation sets of all datasets.

python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH>

For example, evaluate our pre-trained maftp_l.pth model:

# 1. Download MAFTP-Large.
# 2. put it at `out/semantic/MAFT_Plus/maftp_l.pth`.
# 3. evaluation
  python train_net.py --config-file configs/semantic/eval.yaml  --num-gpus 8 --eval-only \
                       MODEL.WEIGHTS out/semantic/MAFT_Plus/maftp_l.pth

Training

end to end training requires 8*A100 GPUs and 14 hours, approximately:

    # MAFT-Plus-Large (maftp-l)
    python train_net.py --config-file configs/semantic/train_semantic_large.yaml  --num-gpus 8

    # MAFT-Plus-Base (maftp-b)
    python train_net.py --config-file configs/semantic/train_semantic_base.yaml  --num-gpus 8

Inference Demo with Pre-trained Models

We provide demo/demo.py that is able to demo builtin configs. Run it with:

python demo/demo.py \
  --input input1.jpg input2.jpg \
  [--other-options]
  --opts MODEL.WEIGHTS /path/to/checkpoint_file

For example, evaluate our pre-trained maftp_l.pth model:

# 1. Download MAFTP-Large.
# 2. put it at `out/semantic/MAFT_Plus/maftp_l.pth`.
# 3. run demo:
  python demo/demo.py  --input im.png

Cite

If this codebase is useful to you, please consider citing:

@inproceedings{jiao2024collaborative,
  title={Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation},
  author={Jiao, Siyu and Zhu, Hongguang and Huang, Jiannan and Zhao, Yao and Wei, Yunchao and Humphrey, Shi},
  booktitle={European Conference on Computer Vision},
  year={2024},
}

Acknowledgement

Mask2Former

FC-CLIP

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
configs		configs
demo		demo
maft		maft
resources		resources
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAFT+ (ECCV 2024 oral)

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Introduction

Installation

Data Preparation

Usage

Pretrained Weights

Evaluation

Training

Inference Demo with Pre-trained Models

Cite

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

Model	A-847	A-150	PC-459	PC-59	PAS-20	Weights
MAFTP-Base	13.8	34.6	16.2	57.5	95.4	maftp_b.pth
MAFTP-Large	15.1	36.1	21.6	59.4	96.5	maftp_l.pth

License

jiaosiyu1999/MAFT-Plus

Folders and files

Latest commit

History

Repository files navigation

MAFT+ (ECCV 2024 oral)

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Introduction

Installation

Data Preparation

Usage

Pretrained Weights

Evaluation

Training

Inference Demo with Pre-trained Models

Cite

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages