compatible with trl 0.9.6 (modelscope#1326)

hjh0119 · Jul 8, 2024 · 7c76d04 · 7c76d04
1 parent 2204bc0
commit 7c76d04
Show file tree

Hide file tree

Showing 9 changed files with 21 additions and 325 deletions.
diff --git a/docs/source/LLM/人类偏好对齐训练文档.md b/docs/source/LLM/人类偏好对齐训练文档.md
@@ -145,32 +145,11 @@ swift rlhf \
     --save_total_limit  2
 ```
 
-使用$(x,y_w,y_l)$格式数据训练
-```bash
-CUDA_VISIBLE_DEVICES=0 \
-swift rlhf \
-    --rlhf_type dpo \
-    --loss_type kto_pair \
-    --model_type  llama3-8b-instruct \
-    --beta 0.1 \
-    --desirable_weight 1.0 \
-    --undesirable_weight 1.0 \
-    --sft_type  lora \
-    --dataset shareai-llama3-dpo-zh-en-emoji \
-    --num_train_epochs  2  \
-    --lora_target_modules  ALL  \
-    --gradient_checkpointing  true  \
-    --batch_size  1  \
-    --learning_rate  5e-5  \
-    --gradient_accumulation_steps  16  \
-    --warmup_ratio  0.03  \
-    --save_total_limit  2
-```
-
 ## CPO
 [论文arvix](https://arxiv.org/abs/2401.08417)
 超参
 - beta：隐含奖励前的系数，默认为0.1
+- cpo_alpha: nll loss系数, 默认为1.0
 
 训练脚本
 ```bash
@@ -221,6 +200,7 @@ swift rlhf \
 超参
 - beta：隐含奖励前的系数，默认为2.0
 - simpo_gamma：reward margin项，默认为1.0
+- cpo_alpha: 混合CPO nll loss提高训练稳定性, 默认为1.0, 设置0.0使用原始SimPO算法
 
 ```bash
 CUDA_VISIBLE_DEVICES=0 \
@@ -229,6 +209,7 @@ swift rlhf \
     --model_type  llama3-8b-instruct \
     --beta 2.0 \
     --simpo_gamma 1.0 \
+    --cpo_alpha 1.0 \
     --sft_type  lora \
     --dataset shareai-llama3-dpo-zh-en-emoji \
     --num_train_epochs  2  \

diff --git a/docs/source/LLM/命令行参数.md b/docs/source/LLM/命令行参数.md
@@ -234,9 +234,10 @@ RLHF参数继承了sft参数, 除此之外增加了以下参数:
 - `--label_smoothing`: 是否使用DPO smoothing, 默认值为0，一般设置在0~0.5之间.
 - `--loss_type`: loss类型, 默认值'sigmoid'.
 - `--sft_beta`: 是否在DPO中加入sft loss, 默认为0.1, 支持 $[0, 1)$ 区间，最后的loss为`(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
-- `simpo_gamma`: SimPO算法中的reward margin项，论文中建议设置为0.5-1.5, 默认为1.0
-- `desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为1.0
-- `undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量，论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$
+- `--simpo_gamma`: SimPO算法中的reward margin项，论文中建议设置为0.5-1.5, 默认为1.0
+- `--cpo_alpha`: CPO loss 中 nll loss的系数, 默认为1.0, 在SimPO中使用混合nll loss以提高训练稳定性
+- `--desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ，默认为1.0
+- `--undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量，论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$
 
 ## merge-lora infer 参数
 

diff --git a/docs/source_en/LLM/Command-line-parameters.md b/docs/source_en/LLM/Command-line-parameters.md
@@ -235,9 +235,10 @@ RLHF parameters are an extension of the sft parameters, with the addition of the
 - `--label_smoothing`: Whether to use DPO smoothing, the default value is 0, normally set between 0 and 0.5.
 - `--loss_type`: Type of loss, default value is 'sigmoid'.
 - `--sft_beta`: Whether to include sft loss in DPO, default is 0.1, supporting the range $[0, 1)$ . The final loss is `(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
-- `simpo_gamma`: The reward margin term in the SimPO algorithm, the paper recommends setting it to 0.5-1.5, the default is 1.0.
-- `desirable_weight`: The loss weight for desirable responses $\lambda_D$ in the KTO algorithm, default is 1.0.
-- `undesirable_weight`: The loss weight for undesirable responses $\lambda_U$ in the KTO paper, default is 1.0. Let $n_d$ and $n_u$ represent the number of desirable and undesirable examples in the dataset, respectively. The paper recommends controlling $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$.
+- `--simpo_gamma`: The reward margin term in the SimPO algorithm, the paper recommends setting it to 0.5-1.5, the default is 1.0.
+- `--cpo_alpha`: The coefficient for the NLL loss in the CPO loss, with a default value of 1.0. In SimPO, a mixed NLL loss is employed to enhance training stability.
+- `--desirable_weight`: The loss weight for desirable responses $\lambda_D$ in the KTO algorithm, default is 1.0.
+- `--undesirable_weight`: The loss weight for undesirable responses $\lambda_U$ in the KTO paper, default is 1.0. Let $n_d$ and $n_u$ represent the number of desirable and undesirable examples in the dataset, respectively. The paper recommends controlling $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$.
 
 ## merge-lora infer Parameters
 

diff --git a/docs/source_en/LLM/Human-Preference-Alignment-Training-Documentation.md b/docs/source_en/LLM/Human-Preference-Alignment-Training-Documentation.md
@@ -144,33 +144,11 @@ swift rlhf \
     --save_total_limit  2
 ```
 
-Training script using $(x,y_w,y_l)$ format data
-
-```bash
-CUDA_VISIBLE_DEVICES=0 \
-swift rlhf \
-    --rlhf_type dpo \
-    --loss_type kto_pair \
-    --model_type  llama3-8b-instruct \
-    --beta 0.1 \
-    --desirable_weight 1.0 \
-    --undesirable_weight 1.0 \
-    --sft_type  lora \
-    --dataset shareai-llama3-dpo-zh-en-emoji \
-    --num_train_epochs  2  \
-    --lora_target_modules  ALL  \
-    --gradient_checkpointing  true  \
-    --batch_size  1  \
-    --learning_rate  5e-5  \
-    --gradient_accumulation_steps  16  \
-    --warmup_ratio  0.03  \
-    --save_total_limit  2
-```
-
 ## CPO
 [Paper arvix](https://arxiv.org/abs/2401.08417)
 Hyperparameters
 - beta: The beta factor in CPO loss., default is 0.1
+- cpo_alpha: Controls the strength of the BC regularizer in CPO training, default is 1.0
 
 Training script
 ```bash
@@ -222,6 +200,7 @@ swift rlhf \
 Hyperparameters
 - beta: Coefficient before the hidden reward, default is 2.0
 - simpo_gamma: Reward margin term, default is 1.0
+- cpo_alpha: Controls the strength of the BC regularizer in CPO training, mix nll loss in CPO to enhances training stability, with a default value of 1.0. Setting it to 0.0 uses the original SimPO algorithm.
 
 Training script
 ```bash

diff --git a/requirements/framework.txt b/requirements/framework.txt
@@ -18,4 +18,4 @@ tensorboard
 tqdm
 transformers>=4.33,<4.43
 transformers_stream_generator
-trl>=0.9.4
+trl>=0.9.6
diff --git a/swift/llm/utils/argument.py b/swift/llm/utils/argument.py
@@ -1435,10 +1435,11 @@ class RLHFArguments(SftArguments):
     max_prompt_length: int = 1024
     beta: Optional[float] = None
     label_smoothing: float = 0.0
-    loss_type: Literal['sigmoid', 'hinge', 'ipo', 'kto_pair', 'robust', 'bco_pair', 'sppo_hard', 'nca_pair', 'simpo',
-                       'kto', 'bco'] = None
+    loss_type: Optional[str] = None
     sft_beta: float = 0.1
+    # SimPO
     simpo_gamma: float = 1.0  # reward margin hyperparameter in SimPO
+    cpo_alpha: float = 1.0
     # KTO
     desirable_weight: float = 1.0
     undesirable_weight: float = 1.0
@@ -1448,8 +1449,7 @@ def __post_init__(self) -> None:
         # without reference model
         self.ref_model_free = self.rlhf_type in ['orpo', 'simpo', 'cpo']
         if self.rlhf_type == 'simpo':
-            self.loss_type = 'simpo'  # compatibility with trl > 0.9.5
-            self.gamma = self.simpo_gamma  # compatibility with trl <= 0.9.4
+            self.loss_type = 'simpo'
         self.set_default_beta()
         self.set_default_loss_type()
         self.set_default_config()
@@ -1472,10 +1472,6 @@ def set_default_config(self):
             'cpo': 'trl.trainer.cpo_config.CPOConfig',
             'dpo': 'trl.trainer.dpo_config.DPOConfig'
         }
-        import trl
-        if version.parse(trl.__version__) <= version.parse('0.9.4'):
-            CONFIG_MAPPING['simpo'] = 'trl.trainer.dpo_config.DPOConfig'
-
         if self.rlhf_type in CONFIG_MAPPING:
             config_path = CONFIG_MAPPING[self.rlhf_type]
             module_path, config_name = config_path.rsplit('.', 1)
@@ -1494,8 +1490,9 @@ def set_default_config(self):
 
     def check_loss_type(self):
         supported_loss_types = {
-            'dpo': ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'bco_pair', 'sppo_hard', 'nca_pair', 'robust'],
-            'cpo': ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'simpo'],
+            'dpo':
+            ['sigmoid', 'hinge', 'ipo', 'bco_pair', 'sppo_hard', 'nca_pair', 'robust', 'aot', 'aot_pair', 'exo_pair'],
+            'cpo': ['sigmoid', 'hinge', 'ipo', 'simpo'],
             'kto': ['kto', 'bco']
         }
         if self.rlhf_type in supported_loss_types:

diff --git a/swift/trainers/__init__.py b/swift/trainers/__init__.py
@@ -7,7 +7,6 @@
     from .arguments import Seq2SeqTrainingArguments, TrainingArguments
     from .dpo_trainer import DPOTrainer
     from .orpo_trainer import ORPOTrainer
-    from .simpo_trainer import SimPOTrainer
     from .rlhf_trainers import RLHFTrainerFactory
     from .trainers import Seq2SeqTrainer, Trainer
     from .utils import EvaluationStrategy, FSDPOption, HPSearchBackend, HubStrategy, \
@@ -18,7 +17,6 @@
         'arguments': ['Seq2SeqTrainingArguments', 'TrainingArguments'],
         'dpo_trainer': ['DPOTrainer'],
         'orpo_trainer': ['ORPOTrainer'],
-        'simpo_trainer': ['SimPOTrainer'],
         'rlhf_trainers': ['RLHFTrainerFactory'],
         'trainers': ['Seq2SeqTrainer', 'Trainer'],
         'utils': [

diff --git a/swift/trainers/rlhf_trainers.py b/swift/trainers/rlhf_trainers.py
@@ -38,12 +38,6 @@ def get_training_args(args: RLHFArguments):
     @staticmethod
     def get_trainer(rlhf_type):
         module_path, class_name = RLHFTrainerFactory.TRAINERS_MAPPING[rlhf_type].rsplit('.', 1)
-        if rlhf_type == 'simpo':
-            import trl
-            from packaging import version
-            if version.parse(trl.__version__) <= version.parse('0.9.4'):
-                module_path = 'swift.trainers.simpo_trainer'
-                class_name = 'SimPOTrainer'
         module = importlib.import_module(module_path)
         trainer_class = getattr(module, class_name)
         return trainer_class