Skip to content

[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Notifications You must be signed in to change notification settings

zzxslp/SoM-LLaVA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 

Repository files navigation

📝 [COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Empowering Open-Source Multimodal LLMs with Set-of-Mark Prompting and Improved Visual Reasoning Ability.

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs [Paper] [HF Model]

📣 Note: Our new dataset is complementary to existing training sources, add it to your train set and boost your multimodal LLMs with Set-of-Mark prompting and improved general capacity! No cost at inference time!

🔥 News

  • [08/20] BLIP-3 is out! Our dataset is used in the finetuning stage of BLIP-3 to boost performance!
  • [07/10] Our paper is accepted at COLM-2024, see you in Philly!
  • [04/26] Thanks AK and HF daily papers for featuring our work!
  • [04/25] Our paper is on arxiv! [Paper]
  • [04/23] Models and datasets of SoM-LLaVA are released! [HF Model] [Dataset]

📜 Contents

📊 Results

Method GQA POPE MME MMB SEED-I LLaVA-Wild MM-VET
LLaVA-1.5-7B 62.0 85.9 1464.0 65.4 64.8 63.4 30.5
SoM-LLaVA-1.5-7B 62.7 86.5 1507.0 66.5 67.0 66.9 33.3
LLaVA-1.5-13B 63.3 85.9 1531.3 68.9 68.2 70.7 35.4
SoM-LLaVA-1.5-13B 63.8 86.6 1563.1 69.5 69.6 75.3 35.9

📣 Note:

We get 1% to 6% relative improvements on all MLLM benchmarks, by simply adding 30k SoM data to the visual instruction tuning stage of LLaVA.

You can optionally feed the model with tagged images during inference to boost performance on some benchmarks, but you can enjoy the performance gain with just standard images!

🌱 SoM Dataset

[Training data for SoM-LLaVA]

som_llava_mix695k.json: Full SFT data with llava-665k + SoM-30k

som_listing_coco10k.json: listing all items with SoM images.

som_qa_coco20k.json: QA with SoM images. (Note: QA used the same 10k images from listing, with another batch of 10k added.)

som_train2017.zip: A subset of 20k coco images that is annotated with SoM, used in our data construction.

🍰 Model Checkpoints

We release our main model, SoM-LLaVA trained with LLaVA-665k and SoM-style Listing + QA data.

[SoM-LLaVA-v1.5-13B-HF] (model weights converted into HF format, see usage below)

[SoM-LLaVA-v1.5-7B] (model weights in original LLaVA format, load and eval with LLaVA)

[SoM-LLaVA-v1.5-13B] (model weights in original LLaVA format, load and eval with LLaVA)

Two additional models for ablation study:

[SoM-LLaVA-v1.5-13B-listing]

[SoM-LLaVA-v1.5-13B-qa]

🍡 Showcases

🍄 Training

We adopt the training code of LLaVA. Please set up environments following the instructions. Currently our data is used in the Visual Instruction Tuning stage.

  1. Prepare data

Please download the annotation of the final mixture of our instruction tuning data som_llava_mix695k.json , which is a mixture of llava_mix665k and 30k SoM data, and download the images from the following datasets:

After downloading all of them, organize the data as follows in your data folder.

├── coco
│   ├── train2017
│   └── som_train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2
  1. Training

After downloading our data (or preparing your own SoM data), train SoM-LLaVA via command line:

bash scripts/v1_5/finetune.sh

❄️ Using SoM

Note: Our implementation is improved over the original SoM repo, by removing overlapping regions for each mask (otherwise there will be confilicts/overlaps for tag positions).

  • Init virtual envs
# create env. Note: must use 3.10, 3.11 will cause package conflicts.
conda create -n som python=3.10 -y
conda activate som
  • Install libgeos if there is error installing SEEM
sudo apt-get update
sudo apt-get install libgeos-c1v5 libgeos-dev
  • Install segmentation packages
# download repo and navigate to SoM folder
git clone https://github.com/zzxslp/SoM-LLaVA.git
cd ~/SoM-LLaVA/SoM/

# install PyTorch
pip3 install torch torchvision torchaudio

# install SEEM
pip install git+https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once.git@package
# install SAM
pip install git+https://github.com/facebookresearch/segment-anything.git
# install Semantic-SAM
pip install git+https://github.com/UX-Decoder/Semantic-SAM.git@package
# install Deformable Convolution for Semantic-SAM
cd ops && sh make.sh && cd ..

# common error fix:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

# install additional packages
pip install datasets
  • Download the pretrained models
sh download_ckpt.sh
  • Annotate COCO images with SoM
python annotate_coco.py

😊 Using LLaVA in HF

If you would like to load our model in huggingface, here is an example script:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_path = "zzxslp/som-llava-v1.5-13b-hf"

model = LlavaForConditionalGeneration.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=20)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print (output)

Note: to reproduce the results reported in the paper, we recommend using the official LLaVA repo with our LLaVA-format model.

🐱 Citation

If you find our data or model useful for your research and applications, please cite our paper:

@article{yan2024list,
  title={List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs},
  author={Yan, An and Yang, Zhengyuan and Wu, Junda and Zhu, Wanrong and Yang, Jianwei and Li, Linjie and Lin, Kevin and Wang, Jianfeng and McAuley, Julian and Gao, Jianfeng and others},
  journal={arXiv preprint arXiv:2404.16375},
  year={2024}
}

🍻 Acknowledgments

This project is a collaboration between UC San Diego, Microsoft GenAI, and Microsoft Research, built on top of SoM and LLaVA. Thank the authors for their contributions to the community!

About

[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published