support ORPO algorithm (modelscope#854)

starxhong · May 7, 2024 · eb940b1 · eb940b1
1 parent e5bb385
commit eb940b1
Show file tree

Hide file tree

Showing 32 changed files with 732 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -39,6 +39,7 @@ To facilitate use by users unfamiliar with deep learning, we provide a Gradio we
 Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
 
 ## 🎉 News
+- 2024.05.07: Supoprts **ORPO** training! See [document](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/ORPO.md) to start training!
 - 2024.04.29: Supports inference and fine-tuning of InternVL-Chat-V1.5 model. For best practice, you can refer to [here](https://github.com/modelscope/swift/tree/main/docs/source_en/Multi-Modal/internvl-best-practice.md).
 - 🔥2024.04.26: Support **LISA** and **unsloth** training! Specify `--lisa_activated_layers=2` to use LISA(to reduce the memory cost to 30 percent!), specify `--tuner_backend unsloth` to use unsloth to train a huge model(full or lora) with lesser memory(30 percent or lesser) and faster speed(5x)!
 - 🔥2024.04.26: Support the fine-tuning and inference of Qwen1.5-110B and Qwen1.5-110B-Chat model, use [this script](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh) to start training!
@@ -103,7 +104,7 @@ Additionally, we are expanding capabilities for other modalities. Currently, we
 
 - 2024.01.04: Update [Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md) for convenient viewing of training speed and memory usage of different models.
 - 🔥2023.12.29: Support web-ui for sft training and inference, use `swift web-ui` after installing ms-swift to start.
-- 🔥2023.12.29: Support DPO RLHF (Reinforcement Learning from Human Feedback) and three datasets for this task: AI-ModelScope/stack-exchange-paired, AI-ModelScope/hh-rlhf and AI-ModelScope/hh_rlhf_cn. See [documentation](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83%E6%96%87%E6%A1%A3.md) to start training!
+- 🔥2023.12.29: Support DPO RLHF (Reinforcement Learning from Human Feedback) and three datasets for this task: AI-ModelScope/stack-exchange-paired, AI-ModelScope/hh-rlhf and AI-ModelScope/hh_rlhf_cn. See [documentation](https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/DPO.md) to start training!
 - 🔥2023.12.28: Support SCEdit! This tuner can significantly reduce memory usage in U-Net and support low-memory controllable image generation (replacing ControlNet), read the section below to learn more.
 - 2023.12.23: Support [codegeex2-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/codegeex2_6b).
 - 2023.12.19: Support [phi2-3b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/phi2_3b).
@@ -209,7 +210,7 @@ You can refer to the following scripts to customize your own training script.
 |------------------|-------------------------------------------------------------------------------|
 | Pretraining      | Text Generation                                                               |
 | Fine-tuning      | Single-turn/Multi-turn<br>Agent Training/Self-cognition<br>Multi-modal Vision/Multi-modal Speech|
-| Human Alignment  | DPO                                                                           |
+| Human Alignment  | DPO<br>ORPO                                                                   |
 | Text-to-Image    | DreamBooth, etc.                                                              |
 | Text-to-Video    | -                                                                             |
 
@@ -570,7 +571,8 @@ make docs
 | [LLM Evaluation](docs/source_en/LLM/LLM-eval.md)     |
 | [LLM Quantization](docs/source_en/LLM/LLM-quantization.md)   |
 | [LLM Deployment](docs/source_en/LLM/VLLM-inference-acceleration-and-deployment.md) |
-| [DPO Human Alignment Training](docs/source_en/LLM/RLHF.md)   |
+| [DPO Human Alignment Training](docs/source_en/LLM/DPO.md)   |
+| [ORPO Human Alignment Training](docs/source_en/LLM/ORPO.md)   |
 | [AnimateDiff Training](docs/source_en/AIGC/AnimateDiff-train-infer.md) |
 
 ### Reference Documentation

diff --git a/README_CN.md b/README_CN.md
@@ -40,6 +40,7 @@ SWIFT支持近**200种LLM和MLLM**（多模态大模型）的训练、推理、
 此外，我们也在拓展其他模态的能力，目前我们支持了AnimateDiff的全参数训练和LoRA训练。
 
 ## 🎉 新闻
+- 2024.05.07: 支持**ORPO**训练，使用`swift orpo`来开始使用， 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/LLM/ORPO最佳实践.md)
 - 2024.04.29: 支持InternVL-Chat-V1.5的推理与微调, 最佳实践可以查看[这里](https://github.com/modelscope/swift/tree/main/docs/source/Multi-Modal/internvl最佳实践.md).
 - 🔥2024.04.26: 支持**LISA** 和 **unsloth**训练！指定 `--lisa_activated_layers=2` 来开启LISA（显存使用降低至全参训练的30%），指定 `--tuner_backend unsloth` 来使用unsloth，用更少的显存（30%或更少）更快的速度（5x）训练一个超大模型！
 - 🔥2024.04.26: 支持Qwen1.5-110B和Qwen1.5-110B-Chat模型的推理与微调, 使用[这个脚本](https://github.com/modelscope/swift/blob/main/examples/pytorch/llm/scripts/qwen1half_110b_chat/lora_ddp_ds/sft.sh)来开始训练！
@@ -104,7 +105,7 @@ SWIFT支持近**200种LLM和MLLM**（多模态大模型）的训练、推理、
 - 🔥2024.01.04: 支持**VLLM部署**, 兼容**OpenAI API**样式, 具体可以查看[VLLM推理加速与部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM推理加速与部署.md#部署).
 - 2024.01.04: 更新[Benchmark](https://github.com/modelscope/swift/blob/main/docs/source/LLM/Benchmark.md), 方便查看不同模型训练的速度和所需显存.
 - 🔥 2023.12.29: 支持web-ui进行sft训练和推理，安装ms-swift后使用`swift web-ui`开启
-- 🔥 2023.12.29: 支持 DPO RLHF(Reinforcement Learning from Human Feedback) 和三个用于此任务的数据集: AI-ModelScope/stack-exchange-paired 以及 AI-ModelScope/hh-rlhf 以及 AI-ModelScope/hh_rlhf_cn. 查看[文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83%E6%96%87%E6%A1%A3.md)开启训练！
+- 🔥 2023.12.29: 支持 DPO RLHF(Reinforcement Learning from Human Feedback) 和三个用于此任务的数据集: AI-ModelScope/stack-exchange-paired 以及 AI-ModelScope/hh-rlhf 以及 AI-ModelScope/hh_rlhf_cn. 查看[文档](https://github.com/modelscope/swift/blob/main/docs/source/LLM/DPO%E8%AE%AD%E7%BB%83%E6%96%87%E6%A1%A3.md)开启训练！
 - 🔥 2023.12.28: 支持SCEdit! 该tuner可显著降低U-Net中的显存占用，并支持低显存可控图像生成（取代ControlNet），阅读下面的章节来了解详细信息
 - 2023.12.23: 支持[codegeex2-6b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/codegeex2_6b).
 - 2023.12.19: 支持[phi2-3b](https://github.com/modelscope/swift/tree/main/examples/pytorch/llm/scripts/phi2_3b).
@@ -210,7 +211,7 @@ swift web-ui
 | -------- |------------------------------------|
 | 预训练   | 文本生成                               |
 | 微调     | 单轮/多轮<br>Agent训练/自我认知<br>多模态视觉/多模态语音 |
-| 人类对齐 | DPO                                |
+| 人类对齐 | DPO<br>ORPO                                |
 | 文生图   | DreamBooth等                        |
 | 文生视频 | -                                  |
 
@@ -570,7 +571,7 @@ make docs
 | [LLM评测](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E8%AF%84%E6%B5%8B%E6%96%87%E6%A1%A3.md) |
 | [LLM量化](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E9%87%8F%E5%8C%96%E6%96%87%E6%A1%A3.md) |
 | [LLM部署](https://github.com/modelscope/swift/blob/main/docs/source/LLM/VLLM%E6%8E%A8%E7%90%86%E5%8A%A0%E9%80%9F%E4%B8%8E%E9%83%A8%E7%BD%B2.md) |
-| [DPO人类对齐训练](https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E4%BA%BA%E7%B1%BB%E5%AF%B9%E9%BD%90%E8%AE%AD%E7%BB%83%E6%96%87%E6%A1%A3.md) |
+| [DPO人类对齐训练](https://github.com/modelscope/swift/blob/main/docs/source/LLM/DPO%E8%AE%AD%E7%BB%83%E6%96%87%E6%A1%A3.md) |
 | [AnimateDiff训练](https://github.com/modelscope/swift/blob/main/docs/source/AIGC/AnimateDiff%E5%BE%AE%E8%B0%83%E6%8E%A8%E7%90%86%E6%96%87%E6%A1%A3.md) |
 
 

diff --git a/docs/resources/orpo1.png b/docs/resources/orpo1.png
diff --git a/docs/resources/orpo2.png b/docs/resources/orpo2.png
diff --git a/docs/resources/orpo3.png b/docs/resources/orpo3.png
diff --git a/docs/resources/orpo4.png b/docs/resources/orpo4.png
diff --git a/docs/resources/orpo5.png b/docs/resources/orpo5.png
diff --git a/docs/resources/orpo6.png b/docs/resources/orpo6.png
diff --git a/docs/resources/orpo7.png b/docs/resources/orpo7.png
diff --git a/docs/resources/orpo8.png b/docs/resources/orpo8.png
diff --git a/docs/source/LLM/LLM人类对齐训练文档.md → docs/source/LLM/DPO训练文档.md b/docs/source/LLM/LLM人类对齐训练文档.md → docs/source/LLM/DPO训练文档.md
@@ -1,4 +1,4 @@
-# LLM人类对齐训练文档
+# DPO训练文档
 ## 目录
 - [环境准备](#环境准备)
 - [人类对齐训练](#人类对齐训练)
@@ -76,6 +76,7 @@ cd examples/pytorch/llm
 
 **提示**:
 
+- 如果用带有history的数据训练base模型，需要指定支持多轮对话的template(base模型往往不支持多轮对话)，对于这种情况我们默认设置了`chatml`template，你也可以支持--model_type 来选择训练模型的template
 - 我们默认在训练时设置`--gradient_checkpointing true`来**节约显存**, 这会略微降低训练速度.
 - 如果你使用的是**V100**等较老的GPU, 你需要设置`--dtype AUTO`或者`--dtype fp16`, 因为其不支持bf16.
 - 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[**flash-attn**](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练). 支持flash-attn的模型可以查看[LLM支持的模型](支持的模型和数据集.md#模型)

diff --git a/docs/source/LLM/LLM微调文档.md b/docs/source/LLM/LLM微调文档.md
@@ -3,6 +3,7 @@
 - [环境准备](#环境准备)
 - [微调](#微调)
 - [DPO](#dpo)
+- [ORPO](#orpo)
 - [Merge LoRA](#merge-lora)
 - [量化](#量化)
 - [推理](#推理)
@@ -167,7 +168,10 @@ cd examples/pytorch/llm
 
 
 ## DPO
-如果你要使用DPO进行人类对齐, 你可以查看[人类对齐微调文档](LLM人类对齐训练文档.md).
+如果你要使用DPO进行人类对齐, 你可以查看[DPO训练文档](DPO训练文档.md).
+
+## ORPO
+如果你要使用ORPO进行人类对齐, 你可以查看[ORPO最佳实践](ORPO算法最佳实践.md).
 
 ## Merge LoRA
 提示: **暂时**不支持bnb和auto_gptq量化模型的merge lora, 这会产生较大的精度损失.

diff --git a/docs/source/LLM/ORPO算法最佳实践.md b/docs/source/LLM/ORPO算法最佳实践.md
@@ -0,0 +1,129 @@
+# ORPO算法最佳实践
+[ORPO](https://arxiv.org/abs/2403.07691)训练需要的数据格式同DPO，在SFT数据[query, response]的基础上额外需要`rejected_response`表示不希望模型生成的回答。
+
+ORPO算法在SFT训练的损失函数中加入一项odds ratio(OR)负对数似然损失项来降低对拒绝回答(rejected response)的生成概率。
+其中超参`beta`表示OR损失项的系数，beta越大表示对`rejected_response`的惩罚越大，默认为0.1
+
+本期最佳实践将使用ORPO算法训练[llama3-8b-instruct](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/summary)模型，使其能够用中文回答。
+
+## 目录
+- [环境准备](#环境准备)
+- [数据集](#数据集)
+- [训练](#训练)
+- [推理](#推理)
+
+## 环境准备
+```bash
+# 设置pip全局镜像 (加速下载)
+pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
+# 安装ms-swift
+git clone https://github.com/modelscope/swift.git
+cd swift
+pip install -e '.[llm]'
+
+# 环境对齐 (通常不需要运行. 如果你运行错误, 可以跑下面的代码, 仓库使用最新环境测试)
+pip install -r requirements/framework.txt  -U
+pip install -r requirements/llm.txt  -U
+```
+
+
+## 数据集
+使用[Llama3-DPO-zh-en-emoji](https://modelscope.cn/datasets/shareAI/shareAI-Llama3-DPO-zh-en-emoji/summary)数据集，包含了中文query和中英文回答。
+
+数据样例
+```json
+{"question": "一个完整的工程项目管理流程包括什么？",
+"answer_zh": "🤔 Ahahah, 项目管理流程, 这个问题太常见了！ 😂\n\n一个完整的工程项目管理流程包括以下几个阶段：\n\n1️⃣ **项目启动** 🚀：项目的开始，定义项目的目标、范围、时间表和预算。\n\n2️⃣ **项目计划** 📅：制定项目计划，包括任务分解、资源分配、进度安排和风险管理。\n\n3️⃣ **项目执行** 💪：按照计划执行项目，包括任务执行、进度控制和质量管理。\n\n4️⃣ **项目监控** 👀：对项目的进度、成本和质量进行监控和控制，确保项目不偏离轨道。\n\n5️⃣ **项目收尾** 🎉：项目完成后，进行项目总结、经验分享和文档编制。\n\n6️⃣ **项目评价** 🤔：对项目的成果和过程进行评价，总结经验和教训。\n\n这些阶段是一个完整的工程项目管理流程，但是，具体的项目可能会根据需要添加或删除某些阶段。 🤝\n\n还有一点，项目管理流程中需要注意的几个关键点是：\n\n* **沟通** 💬：项目团队之间、项目团队和客户之间的沟通是非常重要的。\n* **风险管理** ⚠️：项目中总会出现一些风险和问题，需要提前预测和准备。\n* **变更管理** 🔄：项目中可能会出现一些变更，需要及时处理和更新。\n\n总之，项目管理流程是一个复杂的过程，需要项目经理和团队成员的共同努力和合作。 💪",
+"answer_en": "My fellow wizarding friend! 🧙‍♂️\n\nA complete project management process, you ask? Well, I shall conjure up the essential steps for thee! 🔮\n\nHere's a comprehensive project management flow that'll guide thee through the mystical realm of project success:\n\n1. **Initiation**: Define project scope, goals, and deliverables. Identify stakeholders, and determine the project's feasibility. ⚡️\n2. **Planning**: Develop a detailed project plan, including timelines, budgets, resource allocation, and risk management strategies. 📅\n3. **Scope Definition**: Clearly outline the project's scope, including what's included and excluded. This will help prevent scope creep, my friend! 🔒\n4. **Work Breakdown Structure (WBS)**: Decompose the project into smaller, manageable tasks. This will aid in task assignment, scheduling, and tracking progress. 🗂️\n5. **Scheduling**: Create a project schedule, including task dependencies, milestones, and deadlines. This will help thee stay on track and avoid procrastination! ⏰\n6. **Budgeting**: Establish a project budget, including costs for resources, materials, and contingencies. This will ensure thou dost not overspend, my wise friend! 💸\n7. **Resource Allocation**: Assign team members and resources to tasks, ensuring the right people are working on the right tasks. This will optimize productivity and efficiency! 👥\n8. **Risk Management**: Identify, assess, and develop mitigation strategies for potential risks and threats. This will safeguard thy project from unforeseen dangers! ⚠️\n9. **Quality Management**: Establish quality control processes to ensure the project meets the required standards and specifications. This will guarantee a top-notch outcome, my friend! ✨\n10. **Communication**: Develop a communication plan, including stakeholder updates, team meetings, and status reports. This will keep everyone informed and aligned! 📢\n11. **Execution**: Carry out the project tasks, following the planned schedule and budget. This is where the magic happens, my friend! ✨\n12. **Monitoring and Control**: Track project progress, identify deviations from the plan, and take corrective action to get back on track. This will ensure thy project stays on course! 📊\n13. **Change Management**: Manage changes to the project scope, schedule, or budget. This will minimize disruptions and ensure the project remains viable! 🔀\n14. **Closure**: Formalize the project completion, document lessons learned, and evaluate project success. This will bring a sense of accomplishment and closure, my friend! 🎉\n\nThere thou hast it, my fellow wizard! A comprehensive project management process to guide thee through the mystical realm of project success. May thy projects be prosperous and thy stakeholders be delighted! 😊"}
+
+```
+
+swift内置了处理方法将`answer_zh`作为`response`,将`answer_en`作为`rejected_response`, 只需要使用`--dataset shareai-llama3-dpo-zh-en-emoji`作为训练参数
+
+## 训练
+```shell
+# Experimental environment: A100
+# DDP + MP
+# Memory usage: 4*24G
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+NPROC_PER_NODE=2 \
+swift orpo \
+    --model_type  llama3-8b-instruct \
+    --beta 0.5 \
+    --sft_type  lora \
+    --dataset shareai-llama3-dpo-zh-en-emoji \
+    --num_train_epochs  2  \
+    --lora_target_modules  ALL  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  $(expr 16 / $nproc_per_node)  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+# MP(device map)
+# Memory usage: 2*24G
+CUDA_VISIBLE_DEVICES=0,1 \
+swift orpo \
+    --model_type  llama3-8b-instruct \
+    --beta 0.5 \
+    --sft_type  lora \
+    --dataset shareai-llama3-dpo-zh-en-emoji \
+    --num_train_epochs  2  \
+    --lora_target_modules  ALL  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+
+# Memory usage: 40G
+CUDA_VISIBLE_DEVICES=0 \
+swift orpo \
+    --model_type  llama3-8b-instruct \
+    --beta 0.5 \
+    --sft_type  lora \
+    --dataset shareai-llama3-dpo-zh-en-emoji \
+    --num_train_epochs  2  \
+    --lora_target_modules  ALL  \
+    --gradient_checkpointing  true  \
+    --batch_size  1  \
+    --learning_rate  5e-5  \
+    --gradient_accumulation_steps  16  \
+    --warmup_ratio  0.03  \
+    --save_total_limit  2
+```
+**提示**:
+
+- 如果用带有history的数据训练base模型，需要指定支持多轮对话的template(base模型往往不支持多轮对话)，对于这种情况我们默认设置了`chatml`template，你也可以支持--model_type 来选择训练模型的template
+- 我们默认在训练时设置`--gradient_checkpointing true`来**节约显存**, 这会略微降低训练速度.
+- 如果你使用的是**V100**等较老的GPU, 你需要设置`--dtype AUTO`或者`--dtype fp16`, 因为其不支持bf16.
+- 如果你的机器是A100等高性能显卡, 且使用的是qwen系列模型, 推荐你安装[**flash-attn**](https://github.com/Dao-AILab/flash-attention), 这将会加快训练和推理的速度以及显存占用(A10, 3090, V100等显卡不支持flash-attn进行训练). 支持flash-attn的模型可以查看[LLM支持的模型](支持的模型和数据集.md#模型)
+- 如果你需要断网进行训练, 请使用`--model_id_or_path <model_dir>`和设置`--check_model_is_latest false`. 具体参数含义请查看[命令行参数](命令行参数.md).
+- 如果你想在训练时, 将权重push到ModelScope Hub中, 你需要设置`--push_to_hub true`.
+
+## 推理
+下面的推理使用`swift web-ui`命令
+
+### 训练前推理
+> 你是谁
+
+![orpo1](../../resources/orpo1.png)
+
+> 西湖醋鱼怎么做
+
+![orpo2](../../resources/orpo2.png)
+![orpo3](../../resources/orpo3.png)
+![orpo4](../../resources/orpo4.png)
+![orpo5](../../resources/orpo5.png)
+
+
+### 训练后推理
+> 你是谁
+
+![orpo6](../../resources/orpo6.png)
+
+> 西湖醋鱼怎么做
+
+![orpo7](../../resources/orpo7.png)
+![orpo8](../../resources/orpo8.png)
diff --git a/docs/source/LLM/index.md b/docs/source/LLM/index.md
@@ -17,13 +17,13 @@
 
 1. [LLM推理文档](LLM推理文档.md)
 2. [LLM微调文档](LLM微调文档.md)
-3. [DPO训练文档](LLM人类对齐训练文档.md)
+3. [DPO训练文档](DPO训练文档.md)
 4. [界面训练与推理](https://github.com/modelscope/swift/blob/main/docs/source/GetStarted/%E7%95%8C%E9%9D%A2%E8%AE%AD%E7%BB%83%E6%8E%A8%E7%90%86.md)
 5. [LLM评测文档](LLM评测文档.md)
 6. [LLM量化文档](LLM量化文档.md)
 7. [VLLM推理加速与部署](VLLM推理加速与部署.md)
 8. [LLM实验文档](LLM实验文档.md)
-
+9. [ORPO最佳实践](ORPO算法最佳实践.md)
 
 ### 🐔参考文档
 1. [自定义模型和数据集](自定义与拓展.md)

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -25,7 +25,8 @@ Swift DOCUMENTATION
    LLM/Agent微调最佳实践.md
    LLM/LLM推理文档.md
    LLM/LLM微调文档.md
-   LLM/LLM人类对齐训练文档.md
+   LLM/DPO训练文档.md
+   LLM/ORPO算法最佳实践.md
    LLM/VLLM推理加速与部署.md
    LLM/支持的模型和数据集.md
    LLM/自定义与拓展.md

diff --git a/docs/source_en/LLM/RLHF.md → docs/source_en/LLM/DPO.md b/docs/source_en/LLM/RLHF.md → docs/source_en/LLM/DPO.md