Skip to content

Latest commit

 

History

History
 
 

clip

Contrastive Language-Image Pre-Training with EVA (EVA-CLIP)

Table of Contents

Model Card

model name #param. precision data batch size IN-1K zero-shot top-1 weight
eva_clip_psz14 1.1B fp16 LAION-400M 41K 78.5 🤗 HF link (2GB)

We choose to train a 1.1B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance at the challenges in training very large CLIP.

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance, especially on mainstream classification benchmarks such as ImageNet along with its variants. For more details about EVA-CLIP, please refer to Section 2.3.5 of our paper.

We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation learning, AIGC, etc, and we hope our solution for scaling up CLIPs can provide insight for practitioners studying large foundation models.

Performance of EVA-CLIP Vision Encoder on ImageNet-1K

model zero-shot @ 224px linear probing @ 224px linear probing @ 336px fine-tuning @ 224px fine-tuning @ 336px
EVA-CLIP 78.5 (weight | log) 86.5 (weight | log 86.5 (weight | log) 89.1 (weight | log) 89.4 (weight | log)

EVA-CLIP achieves the state-of-the-art top-1 accuracy on ImageNet-1K among all self-supervised learning approaches. We will provide instructions for re-producing these results soon.

EVA-CLIP Zero-shot Evaluation Results

Zero-shot Image Classification Evaluation

The top-1 accuracy of ImageNet-1K variants and ObjectNet.

PWC

model IN-1K IN-V2 IN-Adv. IN-Ren. IN-Ske. ObjectNet
OpenAI CLIP-L 75.55 69.86 70.76 87.83 59.58 68.98
Open CLIP-H 77.96 70.87 59.33 89.33 66.58 69.71
Open CLIP-g 76.65 69.56 57.19 88.69 65.17 67.53
EVA CLIP-g 78.53 71.52 73.59 92.5 67.31 72.33

Zero-shot Video Action Recognition Evaluation

The performance of video action recognition benchmarks.

model UCF-101 Kinetics-400 Kinetics-600 Kinetics-700
OpenAI CLIP-L 76.39 64.47 64.21 57.68
Open CLIP-H 78.16 63.06 63.58 56.09
Open CLIP-g 77.73 61.69 62.16 54.99
EVA CLIP-g 76.05 65.23 64.38 58.4

For video action recognition, we sample only a single center frame each video, turning it into an image classification task. Following the conventional settings, we report the top-1 accuracy for UCF-101 and the mean of top-1 and top-5 accuracy for Kinetics-400/600/700.

Zero-shot Retrieval Evaluation

Dataset Model Text-to-Image Retrival Image-to-Text Retrival
R@1 R@5 R@10 R@1 R@5 R@10
Flickr30k OpenAI CLIP-L 65.18 87.28 92 85.2 97.3 99
Open CLIP-H 77.78 94.14 96.62 90.8 99.3 99.7
Open CLIP-g 76.52 93.62 96.28 90.8 99.1 99.8
EVA CLIP-g 72.64 91.6 95.12 88.3 98.3 99.3
MSCOCO OpenAI CLIP-L 36.51 61.01 71.11 56.34 79.32 86.66
Open CLIP-H 49.47 73.4 81.53 65.96 86.06 91.9
Open CLIP-g 47.99 72.37 80.75 64.96 85.3 91.46
EVA CLIP-g 44.07 68.5 77.33 61.76 83.28 89.96

The zero-shot retrieval performance of EVA-CLIP is relatively inferior to the Open CLIP-H / -g counterpart. We speculate there are two main reasons:

  • The size / capacity of the language tower in EVA-CLIP is much smaller / weaker than Open CLIP-H and Open CLIP-g, i.e., 124M v.s. 354M, and is only ~1/8 of the vision tower. Meanwhile, retrieval tasks depend more on the capacity of the language branch compared with classification tasks.
  • Retrieval tasks seem benefit more from the training dataset size (LAION-2B used by Open CLIP), while we only leverage LAION-400M for EVA-CLIP training. Nevertheless, it is hard to make a head-to-head comparison between different CLIP models. In the future, we will further scale up the language encoder & training data to improve the retrieval performance.

Usage

The use of EVA-CLIP is similar to OpenAI CLIP and Open CLIP. Here we provide a showcase of zero-shot image classification.

First, install PyTorch 1.7.1 (or later) and torchvision, as well as small additional dependencies, and then install this repo as a Python package. On a CUDA GPU machine, the following will do the trick:

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm

The training code of our 1.1B EVA-CLIP will be available at FlagAI. Please stay tuned.

An example:

import torch
from eva_clip import build_eva_model_and_transforms
from clip import tokenize
from PIL import Image

eva_clip_path = "/path/to/eva_clip_psz14.pt" # https://huggingface.co/BAAI/EVA/blob/main/eva_clip_psz14.pt
model_name = "EVA_CLIP_g_14"
image_path = "CLIP.png"
caption = ["a diagram", "a dog", "a cat"]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = build_eva_model_and_transforms(model_name, pretrained=eva_clip_path)
model = model.to(device)

image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = tokenize(caption).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [1.0000e+00, 2.0857e-10, 4.8534e-12]

Acknowledgement

EVA-CLIP is built with OpenAI CLIP, Open CLIP and CLIP Benchmark. Thanks for their awesome work!