This repo contains the code and data for our benchmark paper:
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal LLMs
H. Wang, H. Shi, S. Tan, W. Qin, W. Wang, T. Zhang, A. Nambi, T. Ganu, H. Wang
[Paper] [MMNeedle Dataset]
To use this benchmark, please download the MMNeedle dataset at this link. Alternatively, you could also construct your version of MMNeedle by following the instructions below.
[2024-06-27] New project page set up for MMNeedle.
[2024-06-24] We released the leaderboard for Multimodal Long Context Understanding on paper-with-code!
[2024-06-17] We released the paper, code, and data for Multimodal Needle in a Haystack (MMNeedle) benchmark!
![Screen Shot 2024-06-17 at 7 38 45 PM](https://private-user-images.githubusercontent.com/30172609/340525052-cf481db4-ac83-4940-8897-e27d4faab4a8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2OTA2MTIsIm5iZiI6MTcyMDY5MDMxMiwicGF0aCI6Ii8zMDE3MjYwOS8zNDA1MjUwNTItY2Y0ODFkYjQtYWM4My00OTQwLTg4OTctZTI3ZDRmYWFiNGE4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzExVDA5MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg0YWRjOWM4NjliZmJhNzgyYzg2YzU3Yzg0MWI4MThiZjgxOWE4MWQ2MDUwMjI2NWU1Mjk0ODMxODRkODE5NDcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.iOALOmmWM8_yF0kmzPTjy_X4lJ6V_oY3QutZmLD7j1k)
MMNeedle Evaluation Overview. Correct answers are marked with checkmark (
![Screen Shot 2024-06-17 at 7 39 52 PM](https://private-user-images.githubusercontent.com/30172609/340524882-e105e2f6-0585-4cbc-9e56-0f588134412d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2OTA2MTIsIm5iZiI6MTcyMDY5MDMxMiwicGF0aCI6Ii8zMDE3MjYwOS8zNDA1MjQ4ODItZTEwNWUyZjYtMDU4NS00Y2JjLTllNTYtMGY1ODgxMzQ0MTJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzExVDA5MzE1MlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM5MTQxN2E0MmYwZDZmODAzOGQ4MTEyZTg3MWRkYmI2N2FkMzFhNGM2NTQzYzcwN2E5NDE2MGFlZTA5ZDBjZDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.1GDD8PV6KhIjOg_6l-K-V8ocdU-UVn4kdw-A7vM8SJY)
MMNeedle Evaluation Performance Comparison (Claude-3 refers to Claude 3 Opus, and Gemini-1.0/1.5 refers to Gemini Pro 1.0/1.5). The x-axis shows the results of different models, and the y-axis shows the results on various input image number M and stitching size N. For each row, i.e., setting (M,N), we show the average accuracy (%) of each model. For each stitched image, the color of row r, column c indicates the accuracy of predicting the exact position for samples with the "needle" sub-image in position (r,c) of the stitched image. For the M=10 setting, we show the average accuracy of each location (r,c) over 10 images. A redder cell indicates lower accuracy, while a greener cell indicates higher accuracy. The best result for each row is marked with underlining.
conda env create -f context.yml
Download MS COCO
put val2014, annotations_trainval dir to current directory.
python ./annotations_trainval/file_to_caption.py
python sample_images.py
python sample_stitched_images.py
python sample_single_needles.py
python sample_multiple_needles.py
export BEGIN=0
export N_SEQ=1000
export N_NEEDLES=1
export MODEL_PROVIDER='Gemini'
bash test.sh
export BEGIN=0
export N_SEQ=1000
python evaluate.py
python evaluate_multi.py
@misc{wang2024multimodal,
title={Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models},
author={Hengyi Wang and
Haizhou Shi and
Shiwei Tan and
Weiyi Qin and
Wenyuan Wang and
Tuny Zhang and
Akshay Nambi and
Tanuja Ganu and
Hao Wang},
year={2024},
eprint={2406.11230},
archivePrefix={arXiv},
primaryClass={cs.LG}
}