Skip to content

Commit

Permalink
support mmlongbench benchmark
Browse files Browse the repository at this point in the history
  • Loading branch information
junming-yang committed Jul 12, 2024
1 parent 027e38c commit 868c16e
Show file tree
Hide file tree
Showing 12 changed files with 493 additions and 43 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/

## 🆕 News

- **[2024-07-12]** We have supported [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/), a benchmark for long-context document understanding, thanks to [**mayubo2333**](https://github.com/mayubo2333) 🔥🔥🔥
- **[2024-07-08]** We have supported [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), thanks to [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
- **[2024-07-08]** We have supported [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), thanks to [**czczup**](https://github.com/czczup) 🔥🔥🔥
- **[2024-06-27]** We have supported [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
Expand All @@ -34,7 +35,6 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
- **[2024-06-24]** We have supported the evaluation of [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet), it ranked the **2nd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
- **[2024-06-22]** Since GPT-3.5-Turbo-0613 is no longer supported yet, we switch to GPT-3.5-Turbo-0125 for choice extraction
- **[2024-06-18]** We have supported [**SEEDBench2**](https://arxiv.org/abs/2311.17092), thanks to [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥

## 📊 Datasets, Models, and Evaluation Results

Expand All @@ -60,6 +60,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |

**\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting

Expand Down
1 change: 1 addition & 0 deletions docs/ja/README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |

**\*** ゼロショット設定で合理的な結果を出せないVLMの一部の評価結果のみを提供しています

Expand Down
3 changes: 2 additions & 1 deletion docs/zh-CN/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

## 🆕 更新

- **[2024-07-12]** 支持了多模态长文档内容理解基准 [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/), 感谢 [**mayubo2333**](https://github.com/mayubo2333) 🔥🔥🔥
- **[2024-07-08]** 支持了 [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), 感谢 [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
- **[2024-07-08]** 支持了 [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), 感谢 [**czczup**](https://github.com/czczup) 🔥🔥🔥
- **[2024-06-27]** 支持了 [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
Expand All @@ -32,7 +33,6 @@
- **[2024-06-24]** 支持了 [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet) 的评测,该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)**排名第二** 🔥🔥🔥
- **[2024-06-22]** 由于 GPT-3.5-Turbo-0613 已被 OpenAI 废弃,我们改为使用 GPT-3.5-Turbo-0125 辅助答案提取
- **[2024-06-18]** 支持了 [**SEEDBench2**](https://arxiv.org/abs/2311.17092),感谢 [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
- **[2024-06-18]** 支持了 [**MMT-Bench**](https://mmt-bench.github.io),感谢 [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥

## 📊 评测结果,支持的数据集和模型 <a id="data-model-results"></a>
### 评测结果
Expand All @@ -57,6 +57,7 @@
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| [**MMLongBench-Doc**](https://mayubo2333.github.io/MMLongBench-Doc/)+ | MMLongBench_DOC | VQA | | | |

**\*** 我们只提供了部分模型上的测试结果,剩余模型无法在 zero-shot 设定下测试出合理的精度

Expand Down
6 changes: 5 additions & 1 deletion run.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ def main():

for _, dataset_name in enumerate(args.data):
dataset_kwargs = {}
if dataset_name == 'MMLongBench_DOC':
dataset_kwargs['model'] = model_name
if dataset_name == 'MMBench-Video':
dataset_kwargs['pack'] = args.pack

Expand All @@ -75,7 +77,7 @@ def main():

dataset = build_dataset(dataset_name, **dataset_kwargs)
if dataset is None:
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
continue

result_file = f'{pred_root}/{model_name}_{dataset_name}.xlsx'
Expand Down Expand Up @@ -125,6 +127,8 @@ def main():
judge_kwargs['model'] = 'chatgpt-0125'
elif listinstr(['MMVet', 'MathVista', 'LLaVABench', 'MMBench-Video'], dataset_name):
judge_kwargs['model'] = 'gpt-4-turbo'
elif listinstr(['MMLongBench'], dataset_name):
judge_kwargs['model'] = 'gpt-4o'
if 'OPENAI_API_KEY_JUDGE' in os.environ and len(os.environ['OPENAI_API_KEY_JUDGE']):
judge_kwargs['key'] = os.environ['OPENAI_API_KEY_JUDGE']
if 'OPENAI_API_BASE_JUDGE' in os.environ and len(os.environ['OPENAI_API_BASE_JUDGE']):
Expand Down
8 changes: 4 additions & 4 deletions vlmeval/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
from .image_mcq import ImageMCQDataset, MMMUDataset, CustomMCQDataset
from .image_vqa import ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, CustomVQADataset
from .mmbench_video import MMBenchVideo
from .utils import build_judge, extract_answer_from_item, prefetch_answer, DEBUG_MESSAGE
from .utils import *
from ..smp import *

DATASET_CLASSES = [
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, MMMUDataset,
CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet,
CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, MMLongBench,
CustomVQADataset, MMBenchVideo
]

Expand All @@ -27,7 +27,7 @@ def build_dataset(dataset_name, **kwargs):
return MMBenchVideo(dataset_name, **kwargs)
datasets = [
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, MMLongBench
]
for cls in datasets:
if dataset_name in cls.supported_datasets():
Expand Down Expand Up @@ -55,7 +55,7 @@ def build_dataset(dataset_name, **kwargs):

__all__ = [
'MMBenchVideo', 'ImageYORNDataset', 'ImageMCQDataset', 'MMMUDataset',
'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet',
'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet', 'MMLongBench',
'CustomMCQDataset', 'CustomVQADataset', 'build_dataset', 'img_root_map',
'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
]
Loading

0 comments on commit 868c16e

Please sign in to comment.