Skip to content

Commit

Permalink
[Support] spport VCR benchmark
Browse files Browse the repository at this point in the history
include VCR to VLMEvalKit
  • Loading branch information
junming-yang authored Jul 12, 2024
2 parents e5d67c4 + 7f9c71b commit 6dbb7af
Show file tree
Hide file tree
Showing 11 changed files with 429 additions and 52 deletions.
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/

## 🆕 News

- **[2024-07-12]** We have supported [**VCR**](https://github.com/tianyu-z/vcr), a benchmark for visual caption restoration evaluation, thanks to [**tianyu-z**](https://github.com/tianyu-z) and [**sheryc**](https://github.com/sheryc) 🔥🔥🔥
- **[2024-07-08]** We have supported [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), thanks to [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
- **[2024-07-08]** We have supported [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), thanks to [**czczup**](https://github.com/czczup) 🔥🔥🔥
- **[2024-06-27]** We have supported [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
Expand All @@ -34,7 +35,6 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
- **[2024-06-24]** We have supported the evaluation of [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet), it ranked the **2nd** on the [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard) 🔥🔥🔥
- **[2024-06-22]** Since GPT-3.5-Turbo-0613 is no longer supported yet, we switch to GPT-3.5-Turbo-0125 for choice extraction
- **[2024-06-18]** We have supported [**SEEDBench2**](https://arxiv.org/abs/2311.17092), thanks to [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
- **[2024-06-18]** We have supported [**MMT-Bench**](https://mmt-bench.github.io), thanks to [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥

## 📊 Datasets, Models, and Evaluation Results

Expand All @@ -60,6 +60,7 @@ English | [<a href="/docs/zh-CN/README_zh-CN.md">简体中文</a>] | [<a href="/
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| [**VCR-wiki**](https://huggingface.co/datasets/vcr-org/) | VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] | VCR | | | |

**\*** We only provide a subset of the evaluation results, since some VLMs do not yield reasonable results under the zero-shot setting

Expand Down
1 change: 1 addition & 0 deletions docs/en/Development.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Currently, we organize a benchmark as one single TSV file. During inference, the
| COCO_VAL ||| | | | || | | |
| OCRVQA_[TEST/TESTCORE] ||| || | || | | |
| TextVQA_VAL ||| || | || | | |
| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] ||| || | || | | |

<div align="center"><b>Table 1. TSV fields of supported datasets.</b></div>

Expand Down
1 change: 1 addition & 0 deletions docs/ja/README_ja.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ PS: 日本語の README には最新のアップデートがすべて含まれ
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] ||| || | || | | |

**\*** ゼロショット設定で合理的な結果を出せないVLMの一部の評価結果のみを提供しています

Expand Down
1 change: 1 addition & 0 deletions docs/zh-CN/Development_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
| COCO_VAL ||| | | | || | | |
| OCRVQA_[TEST/TESTCORE] ||| || | || | | |
| TextVQA_VAL ||| || | || | | |
| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] ||| || | || | | |

<div align="center"><b>表 1. 支持的数据集的 TSV 字段。</b></div>

Expand Down
3 changes: 2 additions & 1 deletion docs/zh-CN/README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@

## 🆕 更新

- **[2024-07-12]** 支持了 [**VCR**](https://github.com/tianyu-z/vcr), 一个视觉 caption 修复测试基准, 感谢 [**tianyu-z**](https://github.com/tianyu-z) and [**sheryc**](https://github.com/sheryc) 🔥🔥🔥
- **[2024-07-08]** 支持了 [**InternLM-XComposer-2.5**](https://github.com/InternLM/InternLM-XComposer), 感谢 [**LightDXY**](https://github.com/LightDXY) 🔥🔥🔥
- **[2024-07-08]** 支持了 [**InternVL2**](https://huggingface.co/OpenGVLab/InternVL2-26B), 感谢 [**czczup**](https://github.com/czczup) 🔥🔥🔥
- **[2024-06-27]** 支持了 [**Cambrian**](https://cambrian-mllm.github.io/) 🔥🔥🔥
Expand All @@ -32,7 +33,6 @@
- **[2024-06-24]** 支持了 [**Claude3.5-Sonnet**](https://www.anthropic.com/news/claude-3-5-sonnet) 的评测,该模型在 [**Open VLM Leaderboard**](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)**排名第二** 🔥🔥🔥
- **[2024-06-22]** 由于 GPT-3.5-Turbo-0613 已被 OpenAI 废弃,我们改为使用 GPT-3.5-Turbo-0125 辅助答案提取
- **[2024-06-18]** 支持了 [**SEEDBench2**](https://arxiv.org/abs/2311.17092),感谢 [**Bohao-Lee**](https://github.com/Bohao-Lee)🔥🔥🔥
- **[2024-06-18]** 支持了 [**MMT-Bench**](https://mmt-bench.github.io),感谢 [**KainingYing**](https://github.com/KainingYing)🔥🔥🔥

## 📊 评测结果,支持的数据集和模型 <a id="data-model-results"></a>
### 评测结果
Expand All @@ -57,6 +57,7 @@
| [**RealWorldQA**](https://x.ai/blog/grok-1.5v) | RealWorldQA | MCQ | [**POPE**](https://github.com/AoiDragon/POPE) | POPE | Y/N |
| [**Core-MM**](https://github.com/core-mm/core-mm)- | CORE_MM | VQA | [**MMT-Bench**](https://mmt-bench.github.io) | MMT-Bench_[VAL/VAL_MI/ALL/ALL_MI] | MCQ |
| [**MLLMGuard**](https://github.com/Carol-gutianle/MLLMGuard) - | MLLMGuard_DS | VQA | [**AesBench**](https://github.com/yipoh/AesBench) | AesBench_[VAL/TEST] | MCQ |
| VCR_[EN/ZH]_[EASY/HARD][_ALL/_500/_100] ||| || | || | | |

**\*** 我们只提供了部分模型上的测试结果,剩余模型无法在 zero-shot 设定下测试出合理的精度

Expand Down
5 changes: 3 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
einops
gradio==4.15.0
gradio
huggingface_hub
matplotlib
numpy>=1.23.4
omegaconf
openai==1.3.5
opencv-python>=4.4.0.46
openpyxl
pandas>=1.5.3
pandas
pillow
portalocker
protobuf
Expand All @@ -17,6 +17,7 @@ requests
rich
seaborn
sentencepiece
setuptools
sty
tabulate
tiktoken
Expand Down
21 changes: 13 additions & 8 deletions run.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import torch
import torch.distributed as dist
from vlmeval.smp import *

from vlmeval.config import supported_VLM
from vlmeval.dataset import build_dataset
from vlmeval.inference import infer_data_job
from vlmeval.inference_video import infer_data_job_video
from vlmeval.dataset import build_dataset
from vlmeval.config import supported_VLM
from vlmeval.smp import *
from vlmeval.utils.result_transfer import MMMU_result_transfer, MMTBench_result_transfer


Expand Down Expand Up @@ -72,8 +73,8 @@ def main():
if world_size > 1:
dataset = build_dataset(dataset_name, **dataset_kwargs) if rank == 0 else None
dist.barrier()

dataset = build_dataset(dataset_name, **dataset_kwargs)
else:
dataset = build_dataset(dataset_name, **dataset_kwargs)
if dataset is None:
logger.error(f'Dataset {dataset_name} is not valid, will be skipped. ')
continue
Expand Down Expand Up @@ -133,17 +134,21 @@ def main():
if rank == 0:
if dataset_name in ['MMMU_TEST']:
result_json = MMMU_result_transfer(result_file)
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, json file saved in {result_json}') # noqa: E501
logger.info(f'Transfer MMMU_TEST result to json for official evaluation, '
f'json file saved in {result_json}') # noqa: E501
continue
elif 'MMT-Bench_ALL' in dataset_name:
submission_file = MMTBench_result_transfer(result_file, **judge_kwargs)
logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation (https://eval.ai/web/challenges/challenge-page/2328/overview), submission file saved in {submission_file}') # noqa: E501
logger.info(f'Extract options from prediction of MMT-Bench FULL split for official evaluation '
f'(https://eval.ai/web/challenges/challenge-page/2328/overview), '
f'submission file saved in {submission_file}') # noqa: E501
continue
elif 'MLLMGuard_DS' in dataset_name:
logger.info('The evaluation of MLLMGuard_DS is not supported yet. ') # noqa: E501
continue
elif 'AesBench_TEST' == dataset_name:
logger.info(f'The results are saved in {result_file}. Please send it to the AesBench Team via huangyipo@hotmail.com.') # noqa: E501
logger.info(f'The results are saved in {result_file}. '
f'Please send it to the AesBench Team via huangyipo@hotmail.com.') # noqa: E501
continue

if dataset_name in [
Expand Down
8 changes: 4 additions & 4 deletions vlmeval/dataset/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
from .image_mcq import ImageMCQDataset, MMMUDataset, CustomMCQDataset
from .image_vqa import ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, CustomVQADataset
from .mmbench_video import MMBenchVideo
from .utils import build_judge, extract_answer_from_item, prefetch_answer, DEBUG_MESSAGE
from .utils import *
from ..smp import *

DATASET_CLASSES = [
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, MMMUDataset,
CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet,
CustomMCQDataset, ImageVQADataset, OCRBench, MathVista, LLaVABench, MMVet, VCRDataset,
CustomVQADataset, MMBenchVideo
]

Expand All @@ -27,7 +27,7 @@ def build_dataset(dataset_name, **kwargs):
return MMBenchVideo(dataset_name, **kwargs)
datasets = [
ImageCaptionDataset, ImageYORNDataset, ImageMCQDataset, ImageVQADataset,
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet
MMMUDataset, OCRBench, MathVista, LLaVABench, MMVet, VCRDataset,
]
for cls in datasets:
if dataset_name in cls.supported_datasets():
Expand Down Expand Up @@ -55,7 +55,7 @@ def build_dataset(dataset_name, **kwargs):

__all__ = [
'MMBenchVideo', 'ImageYORNDataset', 'ImageMCQDataset', 'MMMUDataset',
'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet',
'ImageCaptionDataset', 'ImageVQADataset', 'OCRBench', 'MathVista', 'LLaVABench', 'MMVet', 'VCRDataset',
'CustomMCQDataset', 'CustomVQADataset', 'build_dataset', 'img_root_map',
'build_judge', 'extract_answer_from_item', 'prefetch_answer', 'DEBUG_MESSAGE'
]
Loading

0 comments on commit 6dbb7af

Please sign in to comment.