MultiCLIP: Multimodal-Multilabel-Multistage Classification using Language Image Pre-training

Motivation

Research into multimodalities is pivotal for advancing Artificial General Intelligence (AGI). Previous studies have delved into the capabilities of models trained using contrastive losses. Notably, successful models such as CLIP and BLIP have exerted substantial influence. However, their application predominantly focuses on tasks such as Image-Text Retrieval, Visual Question Answering, or Conditional Generation. Classification, despite being a fundamental task, has received comparatively limited attention. Thus, in this project, we demonstrate a practical implementation of a classification task using pretrained models. It is our hope that the examples provided in this repository will inspire further exploration and innovation.

Quick Start

How to Train a Model

To illustrate, consider training a BlipMLDecoderClassifier with a learning rate of 0.01. Execute the following command:

python3 train.py \
        --model_name blip_ml_decoder \
        --learning_rate 1e-2

The model weights will be stored at ./checkpoints/blip_ml_decoder_large_bce_v1_lr0.01_bs256_seed3407_loss.pth.

It is important to note that using a batch size of 256 with the BlipMLDecoderClassifier requires a GPU with at least 24 GB of memory. Additionally, our experiments were conducted on a two-GPU system, where part of the model was allocated to the second GPU. If you are operating on a machine with a single GPU, consider using alternative models such as the BlipClassifier, which is the default model choice.

How to Make Predictions

After training your model, you can proceed to make predictions. For instance, to use the previously trained model for inference, execute the command below:

python3 predict.py \
        --checkpoint_path ./checkpoints/blip_ml_decoder_large_bce_v1_lr0.01_bs256_seed3407_loss.pth

The output of the models will be saved in a .csv file.

Dataset

Links: Dataset on Kaggle

Backbones

Options: Base / Large (Default)

CLIP [1]

BLIP [2]

Multimodal: Fusion Strategies

CLIP

Router
~~Boosting~~

BLIP

Naive
Ensembling
~~Boosting~~
Graph Attention Transformer (GAT)
ML-Decoder [3]

Multilabel: Loss Functions

Binary Cross Entropy Loss with Logits (Default)
Smoothing Loss
Binary Focal Loss with Logits
Angular Additive Margin (AAM) Loss with Logits [4]
ZLPR Loss with Logits [5]

Multistage: Optimization

CLIP: Unimodal Warmup

To optimize the effectiveness of unimodal classifiers, initial warmups are conducted until these classifiers plateau in performance improvements. Subsequently, a router is used to allocate weights to each classifier, thereby enhancing overall performance. If a unimodal classifier excels independently, it often overshadows improvements in classifiers using different modalities. Hence, unimodal warmups are essential for optimizing CLIP-based models.

BLIP: Embedding Extraction

In scenarios with limited computational resources, it is advisable to initially extract embeddings, followed by the application of MLDecoder for sequential learning. This approach allows for the scaling of batch sizes to extremely large quantities, accommodating thousands of samples in a single batch.

Project Structure

    
├── MultiCLIP/
│   ├── checkpoints/*.pth
│   ├── data/*.jpg
│   ├── figures/*.jpg
│   ├── models/*
│   ├── multi_clip/
│   │   ├── models/
│   │   │   ├── __init__.py
│   │   │   ├── blip_classifier.py
│   │   │   ├── clip_classifier.py
│   │   │   ├── config.py
│   │   │   ├── gat.py
│   │   │   ├── ml_decoder.py
│   │   │   └── router.py
│   │   ├── processors/
│   │   │   ├── __init__.py
│   │   │   ├── blip_processor.py
│   │   │   └── clip_processor.py
│   │   ├── trainers/
│   │   │   ├── __init__.py
│   │   │   ├── base_trainer.py
│   │   │   ├── boost_trainer.py
│   │   │   ├── clip_trainer.py
│   │   │   ├── head_trainer.py
│   │   │   └── ml_decoder_trainer.py
│   │   ├── utils/
│   │   │   ├── __init__.py
│   │   │   ├── inference_func.py
│   │   │   ├── label_encoder.py
│   │   │   ├── losses.py
│   │   │   ├── metrics.py
│   │   │   ├── predict_func.py
│   │   │   └── tools.py
│   │   ├── __init__.py
│   │   └── datasets.py
│   ├── .gitignore
│   ├── LICENSE
│   ├── label_encoder.npy
│   ├── predict.py
│   ├── README.md
│   ├── test.csv
│   ├── train_boost.py
│   ├── train.csv
│   └── train.py
└───

Reference

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
Li, Junnan, et al. "Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation." International conference on machine learning. PMLR, 2022.
Ridnik, Tal, et al. "Ml-decoder: Scalable and versatile classification head." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023.
Deng, Jiankang, et al. "Arcface: Additive angular margin loss for deep face recognition." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
Su, Jianlin, et al. "Zlpr: A novel loss for multi-label classification." arXiv preprint arXiv:2208.02955 (2022).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MultiCLIP: Multimodal-Multilabel-Multistage Classification using Language Image Pre-training

Motivation

Quick Start

How to Train a Model

How to Make Predictions

Dataset

Backbones

CLIP [1]

BLIP [2]

Multimodal: Fusion Strategies

CLIP

BLIP

Multilabel: Loss Functions

Multistage: Optimization

CLIP: Unimodal Warmup

BLIP: Embedding Extraction

Project Structure

Reference

Files

README.md

Latest commit

History

README.md

File metadata and controls

MultiCLIP: Multimodal-Multilabel-Multistage Classification using Language Image Pre-training

Motivation

Quick Start

How to Train a Model

How to Make Predictions

Dataset

Backbones

CLIP [1]

BLIP [2]

Multimodal: Fusion Strategies

CLIP

BLIP

Multilabel: Loss Functions

Multistage: Optimization

CLIP: Unimodal Warmup

BLIP: Embedding Extraction

Project Structure

Reference