modelscope · tastelikefeet · Jun 29, 2024 · Jun 21, 2024 · Jun 22, 2024 · Jun 22, 2024
diff --git a/README.md b/README.md
@@ -47,6 +47,7 @@ SWIFT has rich documentations for users, please check [here](https://github.com/
 SWIFT web-ui is available both on [Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) and [ModelScope studio](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary), please feel free to try!
 
 ## 🎉 News
+- 🔥2024.06.29: Support [eval-scope](https://github.com/modelscope/eval-scope)&[open-compass](https://github.com/open-compass/opencompass) for evaluation! Now we have supported over 50 eval datasets like `BoolQ, ocnli, humaneval, math, ceval, mmlu, gsk8k, ARC_e`, please check our [Eval Doc](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/LLM-eval.md) to begin! Next sprint we will support Multi-modal and Agent evaluation, remember to follow us : )
 - 🔥2024.06.28: Support for **Florence** series model! See [document](docs/source_en/Multi-Modal/florence-best-pratice.md)
 - 🔥2024.06.28: Support for Gemma2 series models: gemma2-9b, gemma2-9b-instruct, gemma2-27b, gemma2-27b-instruct.
 - 🔥2024.06.18: Supports **DeepSeek-Coder-v2** series model! Use model_type `deepseek-coder-v2-instruct` and `deepseek-coder-v2-lite-instruct` to begin.
@@ -449,13 +450,13 @@ Original model:
 ```shell
 # We recommend using vLLM for acceleration (arc evaluated in half a minute)
 CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+    --eval_dataset ARC_e --infer_backend vllm
 ```
 
 LoRA fine-tuned:
 ```shell
 CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+    --eval_dataset ARC_e --infer_backend vllm \
     --merge_lora true \
 ```
 

diff --git a/README_CN.md b/README_CN.md
@@ -48,6 +48,7 @@ SWIFT具有丰富的文档体系，如有使用问题请请查看[这里](https:
 可以在[Huggingface space](https://huggingface.co/spaces/tastelikefeet/swift) 和 [ModelScope创空间](https://www.modelscope.cn/studios/iic/Scalable-lightWeight-Infrastructure-for-Fine-Tuning/summary) 中体验SWIFT web-ui功能了。
 
 ## 🎉 新闻
+- 🔥2024.06.29: 支持[eval-scope](https://github.com/modelscope/eval-scope)&[open-compass](https://github.com/open-compass/opencompass)评测! 我们支持了包含`BoolQ, ocnli, humaneval, math, ceval, mmlu, gsk8k, ARC_e`等50+标准数据集在内的评测流程, 请查看我们的[评测文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM评测文档.md)来使用。下个迭代我们会支持多模态评测和Agent评测，记得持续关注我们: )
 - 🔥2024.06.28: 支持**Florence**系列模型: 可以查看[Florence最佳实践](docs/source/Multi-Modal/florence最佳实践.md).
 - 🔥2024.06.28: 支持**Gemma2**系列模型: gemma2-9b, gemma2-9b-instruct, gemma2-27b, gemma2-27b-instruct.
 - 🔥2024.06.18: 支持**DeepSeek-Coder-v2**系列模型! 使用model_type`deepseek-coder-v2-instruct`和`deepseek-coder-v2-lite-instruct`来开启训练和推理.
@@ -445,13 +446,13 @@ CUDA_VISIBLE_DEVICES=0 swift infer \
 ```shell
 # 推荐使用vLLM加速 (半分钟评测完arc):
 CUDA_VISIBLE_DEVICES=0 swift eval --model_type qwen1half-7b-chat \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+    --eval_dataset ARC_e --infer_backend vllm
 ```
 
 LoRA微调后:
 ```shell
 CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir xxx/checkpoint-xxx \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+    --eval_dataset ARC_e --infer_backend vllm \
     --merge_lora true \
 ```
 

diff --git a/docs/source/LLM/LLM评测文档.md b/docs/source/LLM/LLM评测文档.md
@@ -11,37 +11,16 @@ SWIFT支持了eval（评测）能力，用于对原始模型和训练后的模
 
 ## 能力介绍
 
-SWIFT的eval能力使用了魔搭社区[评测框架EvalScope](https://github.com/modelscope/eval-scope)，并进行了高级封装以支持各类模型的评测需求。目前我们支持了**标准评测集**的评测流程，以及**用户自定义**评测集的评测流程。其中**标准评测集**包含：
+SWIFT的eval能力使用了魔搭社区[评测框架EvalScope](https://github.com/modelscope/eval-scope)，以及[Open-Compass](https://hub.opencompass.org.cn/home)，并进行了高级封装以支持各类模型的评测需求。目前我们支持了**标准评测集**的评测流程，以及**用户自定义**评测集的评测流程。其中**标准评测集**包含：
 
-- MMLU
-
-> MMLU（大规模多任务语言理解）旨在通过在zero-shot和few-shot设置中专门评估模型来衡量在预训练期间获得的知识。这使得基准更具挑战性，更类似于我们评估人类的方式。该基准涵盖 STEM、人文科学、社会科学等 57 个科目。它的难度从初级到高级专业水平不等，它考验着世界知识和解决问题的能力。科目范围从传统领域，如数学和历史，到更专业的领域，如法律和道德。主题的粒度和广度使基准测试成为识别模型盲点的理想选择。
->
-> MMLU是一个包含 **57个多选问答任务的英文评测** 数据集【多样性基准】，涵盖了初等数学、美国历史、计算机科学、法律等，难度覆盖高中水平到专家水平的人类知识，是目前主流的LLM评测数据集。
-
-- CEVAL
-
-> C-EVAL是第一个全面的中文评估套件，旨在评估基础模型在中文语境下的先进的知识和推理能力。C-EVAL包括四个难度级别的多项选择题：初中、高中、大学和专业。问题涉及涵盖52个不同的学科领域，从人文学科到科学和工程学科不等。C-EVAL还附带有C-EVAL HARD，这是C-EVAL中非常具有挑战性的一部分主题（子集），需要高级推理能力才能解决。
-
-- GSM8K
-
-> GSM8K（小学数学 8K）是一个包含 8.5K 高质量语言多样化小学数学单词问题的数据集。创建该数据集是为了支持对需要多步骤推理的基本数学问题进行问答的任务。
->
-> GSM8K是一个高质量的英文小学数学问题测试集，包含 7.5K 训练数据和 1K 测试数据。这些问题通常需要 2-8 步才能解决，有效评估了数学与逻辑能力。
-
-- ARC
-
-> AI2的Reasoning Challeng(**arc**)数据集是一个多项选择问答数据集，包含了从3年级到9年级科学考试中的问题。数据集分为两个分区:Easy和Challenge，后一个分区包含需要推理的更难的问题。大多数问题有4个答案选项，只有<1%的问题有3个或5个答案选项。ARC包含一个支持14.3百万KB的非结构化文本段落。
+```text
+'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
+```
 
-- BBH
+数据集的具体介绍可以查看：https://hub.opencompass.org.cn/home
 
-> BBH(BIG-Bench Hard)是从BIG-Bench评估套件中精选出的23个具有挑战性的任务组成的数据集。
->
-> BIG-Bench是一个旨在评估语言模型能力的多样化测试集,包含了被认为超出当前语言模型能力范围的各种任务。在最初的BIG-Bench论文中,研究人员发现当时最先进的语言模型只有在65%的任务上能够通过少量示例提示的方式超过平均人类评估员的表现。
->
-> 因此,研究人员从BIG-Bench中筛选出那23个语言模型未能胜过人类的特别棘手的任务,构建了BBH数据集。这23个任务被认为是语言模型仍面临挑战的代表性难题。研究人员在BBH上评估了思维链提示对提升语言模型表现的效果。
->
-> 总的来说,BBH数据集包含了BIG-Bench中最具挑战性的23项任务,旨在检验语言模型在复杂多步推理问题上的能力极限。通过在BBH上的实验,研究人员能够发现思维链等提示策略对提高语言模型性能的助益。
+> 首次评测时会自动下载数据集文件：https://www.modelscope.cn/datasets/swift/evalscope_resource/files
+> 如果下载失败可以手动下载放置本地路径, 具体可以查看eval的日志输出.
 
 ## 环境准备
 
@@ -64,16 +43,21 @@ pip install -e '.[eval]'
 ```shell
 # 原始模型 (单卡A100大约需要半小时)
 CUDA_VISIBLE_DEVCIES=0 swift eval --model_type qwen2-7b-instruct \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm
+    --eval_dataset ARC_e --infer_backend vllm
 
 # LoRA微调后
 CUDA_VISIBLE_DEVICES=0 swift eval --ckpt_dir qwen2-7b-instruct/vx-xxx/checkpoint-xxx \
-    --eval_dataset ceval mmlu arc gsm8k --infer_backend vllm \
+    --eval_dataset ARC_e --infer_backend vllm \
     --merge_lora true \
 ```
 
 评测的参数列表可以参考[这里](./命令行参数.md#eval参数)。
 
+请注意：评测结果会存储在{--eval_output_dir}/{--name}/{时间戳}下, 如果用户没有改变存储配置，则默认路径在:
+```text
+当前目录(`pwd`路径)/eval_outputs/default/20240628_190000/xxx
+```
+
 
 ### 使用部署的方式评测
 
@@ -83,7 +67,7 @@ CUDA_VISIBLE_DEVICES=0 swift deploy --model_type qwen2-7b-instruct
 
 # 使用API进行评测
 # 如果是非swift部署, 则需要额外传入`--eval_is_chat_model true --model_type qwen2-7b-instruct`
-swift eval --eval_url http://127.0.0.1:8000/v1 --eval_dataset ceval mmlu arc gsm8k
+swift eval --eval_url http://127.0.0.1:8000/v1 --eval_dataset ARC_e
 
 # LoRA微调后的模型同理
 ```
@@ -144,13 +128,13 @@ default.jsonl
     {
         "name": "custom_general_qa", # 评测项名称，可以随意指定
         "pattern": "general_qa", # 该评测集的pattern
-        "dataset": "eval_example/custom_general_qa", # 该评测集的目录
+        "dataset": "eval_example/custom_general_qa", # 该评测集的目录，强烈建议使用绝对路径防止读取失败
         "subset_list": ["default"] # 需要评测的子数据集，即上面的`default_x`文件名
     },
     {
         "name": "custom_ceval",
         "pattern": "ceval",
-        "dataset": "eval_example/custom_ceval",
+        "dataset": "eval_example/custom_ceval", # 该评测集的目录，强烈建议使用绝对路径防止读取失败
         "subset_list": ["default"]
     }
 ]

diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -323,10 +323,13 @@ export参数继承了infer参数, 除此之外增加了以下参数:
 
 eval参数继承了infer参数，除此之外增加了以下参数：（注意: infer中的generation_config参数将失效, 由[evalscope](https://github.com/modelscope/eval-scope)控制.）
 
-- `--eval_dataset`: 评测的官方数据集，默认值为`['ceval', 'gsm8k', 'arc']`, 你可以填入的值包括: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. 如果仅需要评测自定义数据集，可以将该参数设置为`no`.
-- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`, 即使用数据集的默认配置.
-- `--eval_limit`: 每个评测集的子数据集的采样数量, 默认为`None`代表全量评测.
-- `--name`: 用于区分相同配置评估的结果存储路径, 默认使用当前的时间.
+- `--eval_dataset`: 评测的官方数据集, 默认值为空, 代表全量评测, 注意指定了custom_eval_config时本参数不生效.
+  ```text
+  目前支持的数据集包括：'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
+  ```
+- `--eval_few_shot`: 每个评测集的子数据集的few-shot个数, 默认为`None`, 即使用数据集的默认配置. **本参数暂时废弃**
+- `--eval_limit`: 每个评测集的子数据集的采样数量, 默认为`None`代表全量评测. 可以传入整数, 表示每个数据集的评测数量, 也可以传入string, 如`[10:20]`, 代表切片.
+- `--name`: 用于区分相同配置评估的结果存储路径. 如: `{eval_output_dir}/{name}`, 默认在：`eval_outputs/defaults`, 其内部存在以时间命名的文件夹来承载每次评测结果.
 - `--eval_url`: OpenAI标准的模型调用接口, 例如`http://127.0.0.1:8000/v1`. 如果使用部署的方式评估, 则需要进行设置, 通常不需要设置. 默认为`None`.
   ```shell
   swift eval --eval_url http://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
@@ -335,6 +338,9 @@ eval参数继承了infer参数，除此之外增加了以下参数：（注意:
 - `--eval_is_chat_model`: 如果`eval_url`不为空, 则需要传入本值判断是否为`chat`模型, False代表为`base`模型. 默认为`None`.
 - `--custom_eval_config`: 使用自定义数据集进行评测, 需要是一个本地存在的文件路径, 文件格式详见[自定义评测集](./LLM评测文档.md#自定义评测集). 默认为`None`.
 - `--eval_use_cache`: 是否使用已经生成的评测缓存, 使做过的评测不会重新运行而只是重新生成评测结果. 默认`False`.
+- `--eval_output_dir`: 评测结果输出路径, 默认是当前文件夹下的`eval_outputs`路径.
+- `--eval_batch_size`: 评测的输入batch_size, 默认是8
+- `--deploy_timeout`: 评测之前会启动模型部署, 该参数设置部署的等待超时时长, 默认值为60, 代表一分钟.
 
 
 ## app-ui 参数

diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -323,10 +323,13 @@ export parameters inherit from infer parameters, with the following added parame
 
 The eval parameters inherit from the infer parameters, and additionally include the following parameters: (Note: The generation_config parameter in infer will be invalid, controlled by [evalscope](https://github.com/modelscope/eval-scope).)
 
-- `--eval_dataset`: The official dataset for evaluation, with a default value of `['ceval', 'gsm8k', 'arc']`. Possible values include: 'arc', 'gsm8k', 'mmlu', 'cmmlu', 'ceval', 'bbh', 'general_qa'. If only custom datasets need to be evaluated, this parameter can be set to `no`.
-- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset.
-- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation.
-- `--name`: Used to differentiate the result storage path for evaluating the same configuration, with the current time as the default.
+- `--eval_dataset`: The official evaluation dataset, default is `None`, means all datasets. if `custom_eval_config` is specified, this arg will be ignored.
+  ```text
+  Currently supported datasets include: 'obqa', 'AX_b', 'siqa', 'nq', 'mbpp', 'winogrande', 'mmlu', 'BoolQ', 'cluewsc', 'ocnli', 'lambada', 'CMRC', 'ceval', 'csl', 'cmnli', 'bbh', 'ReCoRD', 'math', 'humaneval', 'eprstmt', 'WSC', 'storycloze', 'MultiRC', 'RTE', 'chid', 'gsm8k', 'AX_g', 'bustm', 'afqmc', 'piqa', 'lcsts', 'strategyqa', 'Xsum', 'agieval', 'ocnli_fc', 'C3', 'tnews', 'race', 'triviaqa', 'CB', 'WiC', 'hellaswag', 'summedits', 'GaokaoBench', 'ARC_e', 'COPA', 'ARC_c', 'DRCD'
+  ```
+- `--eval_few_shot`: The few-shot number of sub-datasets for each evaluation set, with a default value of `None`, meaning to use the default configuration of the dataset. **This parameter is currently deprecated.**
+- `--eval_limit`: The sampling quantity for each sub-dataset of the evaluation set, with a default value of `None` indicating full-scale evaluation. You can pass integer(number of samples from each eval dataset) or str(`[10:20]`, slice).
+- `--name`: Used to differentiate the result storage path for evaluating the same configuration. Like: `{eval_output_dir}/{name}`, default will be `eval_outputs/defaults`, in which a timestamp named folder will hold each eval result.
 - `--eval_url`: The standard model invocation interface for OpenAI, for example, `http://127.0.0.1:8000/v1`. This needs to be set when evaluating in a deployed manner, usually not needed. Default is `None`.
   ```shell
   swift eval --eval_url http://127.0.0.1:8000/v1 --eval_is_chat_model true --model_type gpt4 --eval_token xxx
@@ -335,7 +338,9 @@ The eval parameters inherit from the infer parameters, and additionally include
 - `--eval_is_chat_model`: If `eval_url` is not empty, this value needs to be passed to determine if it is a "chat" model. False represents a "base" model. Default is `None`.
 - `--custom_eval_config`: Used for evaluating with custom datasets, and needs to be a locally existing file path. For details on file format, refer to [Custom Evaluation Set](./LLM-eval.md#Custom-Evaluation-Set). Default is `None`.
 - `--eval_use_cache`: Whether to use already generated evaluation cache, so that previously evaluated results won't be rerun but only the evaluation results regenerated. Default is `False`.
-
+- `--eval_output_dir`: Output path for evaluation results, default is `eval_outputs` in the current folder.
+- `--eval_batch_size`: Input batch size for evaluation, default is 8.
+- `--deploy_timeout`: The timeout duration for waiting for model deployment before evaluation, default is 60, which means one minute.
 
 ## app-ui Parameters