2024.06.20
🌟 Benchmark, evaluation code, training data, and model are released!
We introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench consists of three event understanding abilities and six event-related tasks, including 2,190 test instances to comprehensively evaluate the ability to understand video events.
Event-Bench provides a systematic comparison across different kinds of capabilities for existing video MLLMs, and points out the major shortcomings of open-source MLLMs.
Download the raw videos in VNBench from the google drive link. Download the annotation of VNBench from the huggingface link License:
Event-Bench is only used for academic research. Commercial use in any form is prohibited.
Prompt:
The common prompt used in our evaluation follows this format:
<QUESTION>
A. <OPTION1>
B. <OPTION2>
C. <OPTION3>
D. <OPTION4>
Answer with the option's letter from the given choices directly.
Evaluation:
We recommend you to save the inference result in the format as example_result.jsonl. Once you have prepared the model responses in this format, please execute our evaluation script evaluate_em.py, and you will get the accuracy scores.
python evaluate_em.py \
--path $RESULTS_FILE
If you want to use GPT-4-turbo for evaluation, please use the following script evaluate_gpt.py.
python evaluate_gpt.py \
--input_file $INPUT_FILE \
--output_file $OUTPUT_FILE
- Evaluation results of different Video MLLMs.
If you find our work helpful for your research, please consider citing our work.
@misc{du2024eventoriented,
title={Towards Event-oriented Long Video Understanding},
author={Yifan Du and Kun Zhou and Yuqi Huo and Yifan Li and Wayne Xin Zhao and Haoyu Lu and Zijia Zhao and Bingning Wang and Weipeng Chen and Ji-Rong Wen},
year={2024},
eprint={2406.14129},
archivePrefix={arXiv},
primaryClass={cs.CV}
}