support mmlongbench benchmark

zeyofu · Jul 12, 2024 · 868c16e · 868c16e
1 parent 027e38c
commit 868c16e
Show file tree

Hide file tree

Showing 12 changed files with 493 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 
 ## 🆕 News
 
+- **[2024-07-12]** We have supported [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/), a benchmark for long-context document understanding, thanks to [**mayubo2333**](https://github.com/mayubo2333) 🔥🔥🔥
 - **[2024-07-08]** We have supported [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), thanks to [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
 - **[2024-07-08]** We have supported [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), thanks to [**czczup**](https://github.com/czczup) 🔥🔥🔥
 - **[2024-06-27]** We have supported [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
@@ -34,7 +35,6 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 - **[2024-06-24]** We have supported the evaluation of [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet), it ranked the **2nd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
 - **[2024-06-22]** Since GPT-3.5-Turbo-0613 is no longer supported yet, we switch to GPT-3.5-Turbo-0125 for choice extraction
 - **[2024-06-18]** We have supported [**SEEDBench2**](https://arxiv.org/abs/2311.17092),  thanks to [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
-- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
 
 ## 📊 Datasets, Models, and Evaluation Results
 
@@ -60,6 +60,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ  |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |
 
 **\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting
 

diff --git a/docs/ja/README_ja.md b/docs/ja/README_ja.md
@@ -48,6 +48,7 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ  |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |
 
 **\*** ゼロショット設定で合理的な結果を出せないVLMの一部の評価結果のみを提供しています
 

diff --git a/docs/zh-CN/README_zh-CN.md b/docs/zh-CN/README_zh-CN.md
@@ -23,6 +23,7 @@
 
 ## 🆕 更新
 
+- **[2024-07-12]** 支持了多模态长文档内容理解基准 [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/), 感谢 [**mayubo2333**](https://github.com/mayubo2333) 🔥🔥🔥
 - **[2024-07-08]** 支持了 [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), 感谢 [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
 - **[2024-07-08]** 支持了 [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), 感谢 [**czczup**](https://github.com/czczup) 🔥🔥🔥
 - **[2024-06-27]** 支持了 [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
@@ -32,7 +33,6 @@
 - **[2024-06-24]** 支持了 [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet) 的评测，该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 上**排名第二** 🔥🔥🔥
 - **[2024-06-22]** 由于 GPT-3.5-Turbo-0613 已被 OpenAI 废弃，我们改为使用 GPT-3.5-Turbo-0125 辅助答案提取
 - **[2024-06-18]** 支持了 [**SEEDBench2**](https://arxiv.org/abs/2311.17092)，感谢 [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
-- **[2024-06-18]** 支持了 [**MMT-Bench**](https://mmt-bench.github.io)，感谢 [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥
 
 ## 📊 评测结果，支持的数据集和模型 <a id="data-model-results"></a>
 ### 评测结果
@@ -57,6 +57,7 @@
 | [**RealWorldQA**](https://x.ai/blog/grok-1.5v)            | RealWorldQA | MCQ                                          | [**POPE**](https://github.com/AoiDragon/POPE) | POPE                                           | Y/N                                            |
 | [**Core-MM**](https://github.com/core-mm/core-mm)-          | CORE_MM | VQA                                               | [**MMT-Bench**](https://mmt-bench.github.io)                 | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI]                | MCQ      |
 | [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
+| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |
 
 **\*** 我们只提供了部分模型上的测试结果，剩余模型无法在 zero-shot 设定下测试出合理的精度
 

diff --git a/run.py b/run.py
@@ -65,6 +65,8 @@ def main():
 
         for _, dataset_name in enumerate(args.data):
             dataset_kwargs = {}
+            if dataset_name == 'MMLongBench_DOC':
+                dataset_kwargs['model'] = model_name
             if dataset_name == 'MMBench-Video':
                 dataset_kwargs['pack'] = args.pack
 
@@ -75,7 +77,7 @@ def main():
 
             dataset = build_dataset(dataset_name, **dataset_kwargs)
             if dataset is None:
-                logger.error(f'Dataset {dataset_name} is not valid,  will be skipped. ')
+                logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
                 continue
 
             result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
@@ -125,6 +127,8 @@ def main():
                     judge_kwargs['model'] = 'chatgpt-0125'
                 elif listinstr(['MMVet', 'MathVista', 'LLaVABench', 'MMBench-Video'], dataset_name):
                     judge_kwargs['model'] = 'gpt-4-turbo'
+                elif listinstr(['MMLongBench'], dataset_name):
+                    judge_kwargs['model'] = 'gpt-4o'
             if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
                 judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
             if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):

diff --git a/vlmeval/dataset/__init__.py b/vlmeval/dataset/__init__.py
@@ -6,12 +6,12 @@
 from .image_mcq import ImageMCQDataset, MMMUDataset, CustomMCQDataset
 from .image_vqa import ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, CustomVQADataset
 from .mmbench_video import MMBenchVideo
-from .utils import build_judge, extract_answer_from_item, prefetch_answer, DEBUG_MESSAGE
+from .utils import *
 from ..smp import *
 
 DATASET_CLASSES = [
     ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, MMMUDataset,
-    CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet,
+    CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, MMLongBench,
     CustomVQADataset, MMBenchVideo
 ]
 
@@ -27,7 +27,7 @@ def build_dataset(dataset_name, **kwargs):
         return MMBenchVideo(dataset_name, **kwargs)
     datasets = [
         ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
-        MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet
+        MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MMLongBench
     ]
     for cls in datasets:
         if dataset_name in cls.supported_datasets():
@@ -55,7 +55,7 @@ def build_dataset(dataset_name, **kwargs):
 
 __all__ = [
     'MMBenchVideo', 'ImageYORNDataset', 'ImageMCQDataset', 'MMMUDataset',
-    'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet',
+    'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet', 'MMLongBench',
     'CustomMCQDataset', 'CustomVQADataset', 'build_dataset', 'img_root_map',
     'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
 ]