Skip to content

Commit

Permalink
compatible with trl 0.9.6 (modelscope#1326)
Browse files Browse the repository at this point in the history
  • Loading branch information
hjh0119 committed Jul 8, 2024
1 parent 2204bc0 commit 7c76d04
Show file tree
Hide file tree
Showing 9 changed files with 21 additions and 325 deletions.
25 changes: 3 additions & 22 deletions docs/source/LLM/人类偏好对齐训练文档.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,32 +145,11 @@ swift rlhf \
--save_total_limit 2
```

使用$(x,y_w,y_l)$格式数据训练
```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type dpo \
--loss_type kto_pair \
--model_type llama3-8b-instruct \
--beta 0.1 \
--desirable_weight 1.0 \
--undesirable_weight 1.0 \
--sft_type lora \
--dataset shareai-llama3-dpo-zh-en-emoji \
--num_train_epochs 2 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
```

## CPO
[论文arvix](https://arxiv.org/abs/2401.08417)
超参
- beta:隐含奖励前的系数,默认为0.1
- cpo_alpha: nll loss系数, 默认为1.0

训练脚本
```bash
Expand Down Expand Up @@ -221,6 +200,7 @@ swift rlhf \
超参
- beta:隐含奖励前的系数,默认为2.0
- simpo_gamma:reward margin项,默认为1.0
- cpo_alpha: 混合CPO nll loss提高训练稳定性, 默认为1.0, 设置0.0使用原始SimPO算法

```bash
CUDA_VISIBLE_DEVICES=0 \
Expand All @@ -229,6 +209,7 @@ swift rlhf \
--model_type llama3-8b-instruct \
--beta 2.0 \
--simpo_gamma 1.0 \
--cpo_alpha 1.0 \
--sft_type lora \
--dataset shareai-llama3-dpo-zh-en-emoji \
--num_train_epochs 2 \
Expand Down
7 changes: 4 additions & 3 deletions docs/source/LLM/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,9 +234,10 @@ RLHF参数继承了sft参数, 除此之外增加了以下参数:
- `--label_smoothing`: 是否使用DPO smoothing, 默认值为0,一般设置在0~0.5之间.
- `--loss_type`: loss类型, 默认值'sigmoid'.
- `--sft_beta`: 是否在DPO中加入sft loss, 默认为0.1, 支持 $[0, 1)$ 区间,最后的loss为`(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
- `simpo_gamma`: SimPO算法中的reward margin项,论文中建议设置为0.5-1.5, 默认为1.0
- `desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ,默认为1.0
- `undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量,论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$
- `--simpo_gamma`: SimPO算法中的reward margin项,论文中建议设置为0.5-1.5, 默认为1.0
- `--cpo_alpha`: CPO loss 中 nll loss的系数, 默认为1.0, 在SimPO中使用混合nll loss以提高训练稳定性
- `--desirable_weight`: KTO算法中对desirable response的loss权重 $\lambda_D$ ,默认为1.0
- `--undesirable_weight`: KTO论文中对undesirable response的loss权重 $\lambda_U$ , 默认为1.0. 分别用$n_d$ 和$n_u$ 表示数据集中desirable examples和undesirable examples的数量,论文中推荐控制 $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$

## merge-lora infer 参数

Expand Down
7 changes: 4 additions & 3 deletions docs/source_en/LLM/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,10 @@ RLHF parameters are an extension of the sft parameters, with the addition of the
- `--label_smoothing`: Whether to use DPO smoothing, the default value is 0, normally set between 0 and 0.5.
- `--loss_type`: Type of loss, default value is 'sigmoid'.
- `--sft_beta`: Whether to include sft loss in DPO, default is 0.1, supporting the range $[0, 1)$ . The final loss is `(1-sft_beta)*KL_loss + sft_beta * sft_loss`.
- `simpo_gamma`: The reward margin term in the SimPO algorithm, the paper recommends setting it to 0.5-1.5, the default is 1.0.
- `desirable_weight`: The loss weight for desirable responses $\lambda_D$ in the KTO algorithm, default is 1.0.
- `undesirable_weight`: The loss weight for undesirable responses $\lambda_U$ in the KTO paper, default is 1.0. Let $n_d$ and $n_u$ represent the number of desirable and undesirable examples in the dataset, respectively. The paper recommends controlling $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$.
- `--simpo_gamma`: The reward margin term in the SimPO algorithm, the paper recommends setting it to 0.5-1.5, the default is 1.0.
- `--cpo_alpha`: The coefficient for the NLL loss in the CPO loss, with a default value of 1.0. In SimPO, a mixed NLL loss is employed to enhance training stability.
- `--desirable_weight`: The loss weight for desirable responses $\lambda_D$ in the KTO algorithm, default is 1.0.
- `--undesirable_weight`: The loss weight for undesirable responses $\lambda_U$ in the KTO paper, default is 1.0. Let $n_d$ and $n_u$ represent the number of desirable and undesirable examples in the dataset, respectively. The paper recommends controlling $\frac{\lambda_D n_D}{\lambda_Un_U} \in [1,\frac{4}{3}]$.

## merge-lora infer Parameters

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,33 +144,11 @@ swift rlhf \
--save_total_limit 2
```

Training script using $(x,y_w,y_l)$ format data

```bash
CUDA_VISIBLE_DEVICES=0 \
swift rlhf \
--rlhf_type dpo \
--loss_type kto_pair \
--model_type llama3-8b-instruct \
--beta 0.1 \
--desirable_weight 1.0 \
--undesirable_weight 1.0 \
--sft_type lora \
--dataset shareai-llama3-dpo-zh-en-emoji \
--num_train_epochs 2 \
--lora_target_modules ALL \
--gradient_checkpointing true \
--batch_size 1 \
--learning_rate 5e-5 \
--gradient_accumulation_steps 16 \
--warmup_ratio 0.03 \
--save_total_limit 2
```

## CPO
[Paper arvix](https://arxiv.org/abs/2401.08417)
Hyperparameters
- beta: The beta factor in CPO loss., default is 0.1
- cpo_alpha: Controls the strength of the BC regularizer in CPO training, default is 1.0

Training script
```bash
Expand Down Expand Up @@ -222,6 +200,7 @@ swift rlhf \
Hyperparameters
- beta: Coefficient before the hidden reward, default is 2.0
- simpo_gamma: Reward margin term, default is 1.0
- cpo_alpha: Controls the strength of the BC regularizer in CPO training, mix nll loss in CPO to enhances training stability, with a default value of 1.0. Setting it to 0.0 uses the original SimPO algorithm.

Training script
```bash
Expand Down
2 changes: 1 addition & 1 deletion requirements/framework.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,4 @@ tensorboard
tqdm
transformers>=4.33,<4.43
transformers_stream_generator
trl>=0.9.4
trl>=0.9.6
17 changes: 7 additions & 10 deletions swift/llm/utils/argument.py
Original file line number Diff line number Diff line change
Expand Up @@ -1435,10 +1435,11 @@ class RLHFArguments(SftArguments):
max_prompt_length: int = 1024
beta: Optional[float] = None
label_smoothing: float = 0.0
loss_type: Literal['sigmoid', 'hinge', 'ipo', 'kto_pair', 'robust', 'bco_pair', 'sppo_hard', 'nca_pair', 'simpo',
'kto', 'bco'] = None
loss_type: Optional[str] = None
sft_beta: float = 0.1
# SimPO
simpo_gamma: float = 1.0 # reward margin hyperparameter in SimPO
cpo_alpha: float = 1.0
# KTO
desirable_weight: float = 1.0
undesirable_weight: float = 1.0
Expand All @@ -1448,8 +1449,7 @@ def __post_init__(self) -> None:
# without reference model
self.ref_model_free = self.rlhf_type in ['orpo', 'simpo', 'cpo']
if self.rlhf_type == 'simpo':
self.loss_type = 'simpo' # compatibility with trl > 0.9.5
self.gamma = self.simpo_gamma # compatibility with trl <= 0.9.4
self.loss_type = 'simpo'
self.set_default_beta()
self.set_default_loss_type()
self.set_default_config()
Expand All @@ -1472,10 +1472,6 @@ def set_default_config(self):
'cpo': 'trl.trainer.cpo_config.CPOConfig',
'dpo': 'trl.trainer.dpo_config.DPOConfig'
}
import trl
if version.parse(trl.__version__) <= version.parse('0.9.4'):
CONFIG_MAPPING['simpo'] = 'trl.trainer.dpo_config.DPOConfig'

if self.rlhf_type in CONFIG_MAPPING:
config_path = CONFIG_MAPPING[self.rlhf_type]
module_path, config_name = config_path.rsplit('.', 1)
Expand All @@ -1494,8 +1490,9 @@ def set_default_config(self):

def check_loss_type(self):
supported_loss_types = {
'dpo': ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'bco_pair', 'sppo_hard', 'nca_pair', 'robust'],
'cpo': ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'simpo'],
'dpo':
['sigmoid', 'hinge', 'ipo', 'bco_pair', 'sppo_hard', 'nca_pair', 'robust', 'aot', 'aot_pair', 'exo_pair'],
'cpo': ['sigmoid', 'hinge', 'ipo', 'simpo'],
'kto': ['kto', 'bco']
}
if self.rlhf_type in supported_loss_types:
Expand Down
2 changes: 0 additions & 2 deletions swift/trainers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
from .arguments import Seq2SeqTrainingArguments, TrainingArguments
from .dpo_trainer import DPOTrainer
from .orpo_trainer import ORPOTrainer
from .simpo_trainer import SimPOTrainer
from .rlhf_trainers import RLHFTrainerFactory
from .trainers import Seq2SeqTrainer, Trainer
from .utils import EvaluationStrategy, FSDPOption, HPSearchBackend, HubStrategy, \
Expand All @@ -18,7 +17,6 @@
'arguments': ['Seq2SeqTrainingArguments', 'TrainingArguments'],
'dpo_trainer': ['DPOTrainer'],
'orpo_trainer': ['ORPOTrainer'],
'simpo_trainer': ['SimPOTrainer'],
'rlhf_trainers': ['RLHFTrainerFactory'],
'trainers': ['Seq2SeqTrainer', 'Trainer'],
'utils': [
Expand Down
6 changes: 0 additions & 6 deletions swift/trainers/rlhf_trainers.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,12 +38,6 @@ def get_training_args(args: RLHFArguments):
@staticmethod
def get_trainer(rlhf_type):
module_path, class_name = RLHFTrainerFactory.TRAINERS_MAPPING[rlhf_type].rsplit('.', 1)
if rlhf_type == 'simpo':
import trl
from packaging import version
if version.parse(trl.__version__) <= version.parse('0.9.4'):
module_path = 'swift.trainers.simpo_trainer'
class_name = 'SimPOTrainer'
module = importlib.import_module(module_path)
trainer_class = getattr(module, class_name)
return trainer_class
Loading

0 comments on commit 7c76d04

Please sign in to comment.