[Support] spport VCR benchmark

include VCR to VLMEvalKit
zeyofu · Jul 12, 2024 · 6dbb7af · 6dbb7af
2 parents e5d67c4 + 7f9c71b
commit 6dbb7af
Show file tree

Hide file tree

Showing 11 changed files with 429 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 
 ## 🆕 News
 
+- **[2024-07-12]** We have supported [**VCR**](https://github.com/tianyu-z/vcr), a benchmark for visual caption restoration evaluation, thanks to [**tianyu-z**](https://github.com/tianyu-z) and [**sheryc**](https://github.com/sheryc) 🔥🔥🔥
 - **[2024-07-08]** We have supported [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), thanks to [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
 - **[2024-07-08]** We have supported [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), thanks to [**czczup**](https://github.com/czczup) 🔥🔥🔥
 - **[2024-06-27]** We have supported [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
@@ -34,7 +35,6 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 - **[2024-06-24]** We have supported the evaluation of [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet), it ranked the **2nd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
 - **[2024-06-22]** Since GPT-3.5-Turbo-0613 is no longer supported yet, we switch to GPT-3.5-Turbo-0125 for choice extraction
 - **[2024-06-18]** We have supported [**SEEDBench2**](https://arxiv.org/abs/2311.17092),  thanks to [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
-- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
 
 ## 📊 Datasets, Models, and Evaluation Results
 
@@ -60,6 +60,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ  |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| [**VCR-wiki**](https://huggingface.co/datasets/vcr-org/) | VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] | VCR | |  | |
 
 **\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting
 

diff --git a/docs/en/Development.md b/docs/en/Development.md
@@ -19,6 +19,7 @@ Currently, we organize a benchmark as one single TSV file. During inference, the
 | COCO_VAL               | ✅     | ✅     |            |          |      |                         | ✅      |          |             |       |
 | OCRVQA_[TEST/TESTCORE] | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 | TextVQA_VAL            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100]            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 
 <div align="center"><b>Table 1. TSV fields of supported datasets.</b></div>
 

diff --git a/docs/ja/README_ja.md b/docs/ja/README_ja.md
@@ -48,6 +48,7 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ  |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100]            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 
 **\*** ゼロショット設定で合理的な結果を出せないVLMの一部の評価結果のみを提供しています
 

diff --git a/docs/zh-CN/Development_zh-CN.md b/docs/zh-CN/Development_zh-CN.md
@@ -19,6 +19,7 @@
 | COCO_VAL               | ✅     | ✅     |            |          |      |                         | ✅      |          |             |       |
 | OCRVQA_[TEST/TESTCORE] | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 | TextVQA_VAL            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
+| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100]            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 
 <div align="center"><b>表 1. 支持的数据集的 TSV 字段。</b></div>
 

diff --git a/docs/zh-CN/README_zh-CN.md b/docs/zh-CN/README_zh-CN.md
@@ -23,6 +23,7 @@
 
 ## 🆕 更新
 
+- **[2024-07-12]** 支持了 [**VCR**](https://github.com/tianyu-z/vcr), 一个视觉 caption 修复测试基准, 感谢 [**tianyu-z**](https://github.com/tianyu-z) and [**sheryc**](https://github.com/sheryc) 🔥🔥🔥
 - **[2024-07-08]** 支持了 [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), 感谢 [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
 - **[2024-07-08]** 支持了 [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), 感谢 [**czczup**](https://github.com/czczup) 🔥🔥🔥
 - **[2024-06-27]** 支持了 [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
@@ -32,7 +33,6 @@
 - **[2024-06-24]** 支持了 [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet) 的评测，该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 上**排名第二** 🔥🔥🔥
 - **[2024-06-22]** 由于 GPT-3.5-Turbo-0613 已被 OpenAI 废弃，我们改为使用 GPT-3.5-Turbo-0125 辅助答案提取
 - **[2024-06-18]** 支持了 [**SEEDBench2**](https://arxiv.org/abs/2311.17092)，感谢 [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
-- **[2024-06-18]** 支持了 [**MMT-Bench**](https://mmt-bench.github.io)，感谢 [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
 
 ## 📊 评测结果，支持的数据集和模型 <a id="data-model-results"></a>
 ### 评测结果
@@ -57,6 +57,7 @@
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ      |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100]            | ✅     | ✅     |            | ✅        |      |                         | ✅      |          |             |       |
 
 **\*** 我们只提供了部分模型上的测试结果，剩余模型无法在 zero-shot 设定下测试出合理的精度
 

diff --git a/requirements.txt b/requirements.txt
@@ -1,13 +1,13 @@
 einops
-gradio==4.15.0
+gradio
 huggingface_hub
 matplotlib
 numpy>=1.23.4
 omegaconf
 openai==1.3.5
 opencv-python>=4.4.0.46
 openpyxl
-pandas>=1.5.3
+pandas
 pillow
 portalocker
 protobuf
@@ -17,6 +17,7 @@ requests
 rich
 seaborn
 sentencepiece
+setuptools
 sty
 tabulate
 tiktoken

diff --git a/run.py b/run.py
@@ -1,10 +1,11 @@
 import torch
 import torch.distributed as dist
-from vlmeval.smp import *
+
+from vlmeval.config import supported_VLM
+from vlmeval.dataset import build_dataset
 from vlmeval.inference import infer_data_job
 from vlmeval.inference_video import infer_data_job_video
-from vlmeval.dataset import build_dataset
-from vlmeval.config import supported_VLM
+from vlmeval.smp import *
 from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer
 
 
@@ -72,8 +73,8 @@ def main():
             if world_size > 1:
                 dataset = build_dataset(dataset_name, **dataset_kwargs) if rank == 0 else None
                 dist.barrier()
-
-            dataset = build_dataset(dataset_name, **dataset_kwargs)
+            else:
+                dataset = build_dataset(dataset_name, **dataset_kwargs)
             if dataset is None:
                 logger.error(f'Dataset {dataset_name} is not valid,  will be skipped. ')
                 continue
@@ -133,17 +134,21 @@ def main():
             if rank == 0:
                 if dataset_name in ['MMMU_TEST']:
                     result_json = MMMU_result_transfer(result_file)
-                    logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}')  # noqa: E501
+                    logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
+                                f'json file saved in {result_json}')  # noqa: E501
                     continue
                 elif 'MMT-Bench_ALL' in dataset_name:
                     submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
-                    logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation (https://eval.ai/web/challenges/challenge-page/2328/overview), submission file saved in {submission_file}')  # noqa: E501
+                    logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
+                                f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
+                                f'submission file saved in {submission_file}')  # noqa: E501
                     continue
                 elif 'MLLMGuard_DS' in dataset_name:
                     logger.info('The evaluation of MLLMGuard_DS is not supported yet. ')  # noqa: E501
                     continue
                 elif 'AesBench_TEST' == dataset_name:
-                    logger.info(f'The results are saved in {result_file}. Please send it to the AesBench Team via huangyipo@hotmail.com.')  # noqa: E501
+                    logger.info(f'The results are saved in {result_file}. '
+                                f'Please send it to the AesBench Team via huangyipo@hotmail.com.')  # noqa: E501
                     continue
 
             if dataset_name in [

diff --git a/vlmeval/dataset/__init__.py b/vlmeval/dataset/__init__.py
@@ -6,12 +6,12 @@
 from .image_mcq import ImageMCQDataset, MMMUDataset, CustomMCQDataset
 from .image_vqa import ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, CustomVQADataset
 from .mmbench_video import MMBenchVideo
-from .utils import build_judge, extract_answer_from_item, prefetch_answer, DEBUG_MESSAGE
+from .utils import *
 from ..smp import *
 
 DATASET_CLASSES = [
     ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, MMMUDataset,
-    CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet,
+    CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, VCRDataset,
     CustomVQADataset, MMBenchVideo
 ]
 
@@ -27,7 +27,7 @@ def build_dataset(dataset_name, **kwargs):
         return MMBenchVideo(dataset_name, **kwargs)
     datasets = [
         ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
-        MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet
+        MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, VCRDataset,
     ]
     for cls in datasets:
         if dataset_name in cls.supported_datasets():
@@ -55,7 +55,7 @@ def build_dataset(dataset_name, **kwargs):
 
 __all__ = [
     'MMBenchVideo', 'ImageYORNDataset', 'ImageMCQDataset', 'MMMUDataset',
-    'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet',
+    'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet', 'VCRDataset',
     'CustomMCQDataset', 'CustomVQADataset', 'build_dataset', 'img_root_map',
     'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
 ]