Skip to content

Commit

Permalink
1. Fix the bug that occurred when enabling quantization during RM tra…
Browse files Browse the repository at this point in the history
…ining that caused a compliance requirement

2. downgrade torch version to 1.13.1 to avoid unnecessary Installation
3. add some useful comments in readme
4. add a finetune example which enables quant and uses local base model
  • Loading branch information
codemayq committed May 3, 2023
1 parent 6ce2e3f commit 6d160c3
Show file tree
Hide file tree
Showing 5 changed files with 70 additions and 2 deletions.
22 changes: 22 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,17 @@ cd ChatGLM-Efficient-Tuning
pip install -r requirements.txt
```

If you want to enable LoRa or Freeze quantization on Windows, you need to install the Bitsandbytes library additionally.
Because Bitsandbytes currently cannot directly support Windows, we used a pre-built package that currently only supports CUDA 11.6 and CUDA 11.7
```
pip install https://github.com/acpopescu/bitsandbytes/releases/download/v0.37.2-win.1/bitsandbytes-0.37.2-py3-none-any.whl
```

for linux user, just install directly
```
pip install bitsandbytes
```

### Fine-tuning with a Single GPU

```bash
Expand Down Expand Up @@ -140,6 +151,7 @@ CUDA_VISIBLE_DEVICES=0 python src/train_rm.py \
--fp16
```

> The current default version uses the difference in score between the EOS tokens of the accept response and reject response as the learning reward
### Training with RLHF

```bash
Expand Down Expand Up @@ -224,6 +236,16 @@ model.eval()
| Freeze (l=3) | 4 | FP16 | 24GB | 8ex/s |
| Freeze (l=3) | 4 | INT8 | 12GB | 8ex/s |

| Rm method | Batch size | Mode | GRAM | Speed |
|-----------------|------------| ---- |------|-------|
| LoRA (r=8) + rm | 1 | INT8 | 11GB | - |
| LoRA (r=8) + rm | 4 | FP16 | 22GB | - |

| RLHF method | Batch size | Mode | GRAM | Speed |
|------------------|------------| ---- |------|-------|
| LoRA (r=8) + ppo | 1 | INT8 | 12GB | - |
| LoRA (r=8) + ppo | 4 | FP16 | 23GB | - |

> Note: `r` is the lora rank, `p` is the number of prefix tokens, `l` is the number of trainable layers, `ex/s` is the examples per second at training. The `gradient_accumulation_steps` is set to `1`. All are evaluated on a single Tesla V100 (32G) GPU, they are approximated values and may vary in different GPUs.
## Fine-tuning ChatGLM: A Case
Expand Down
24 changes: 24 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,6 +97,17 @@ cd ChatGLM-Efficient-Tuning
pip install -r requirements.txt
```

如果想在windows上开启lora或者freeze的量化, 需要额外装一下bitsandbytes库
因为bitsandbytes目前不能直接支持windows,所以我们使用了一个预构建好的包,该包目前只支持 cuda11.6和cuda11.7
```
pip install https://github.com/acpopescu/bitsandbytes/releases/download/v0.37.2-win.1/bitsandbytes-0.37.2-py3-none-any.whl
```

如果是linux用户,直接安装即可
```
pip install bitsandbytes
```

### 单 GPU 微调训练

```bash
Expand Down Expand Up @@ -144,6 +155,8 @@ CUDA_VISIBLE_DEVICES=0 python src/train_rm.py \
--fp16
```

> 目前默认版本使用accpect response 和reject response 的eos token的分数之差作为学习奖励
### RLHF 训练

```bash
Expand Down Expand Up @@ -229,6 +242,17 @@ model.eval()
| Freeze (l=3) | 4 | FP16 | 24GB | 8ex/s |
| Freeze (l=3) | 4 | INT8 | 12GB | 8ex/s |

| 奖励模型方法 | Batch size | Mode | GRAM | Speed |
|-----------------|------------| ---- |------|-------|
| LoRA (r=8) + rm | 1 | INT8 | 11GB | - |
| LoRA (r=8) + rm | 4 | FP16 | 22GB | - |

| RLHF 训练 | Batch size | Mode | GRAM | Speed |
|------------------|------------| ---- |------|-------|
| LoRA (r=8) + ppo | 1 | INT8 | 12GB | - |
| LoRA (r=8) + ppo | 4 | FP16 | 23GB | - |


> 注:`r` 为LoRA 维数大小,`p` 为前缀词表大小,`l` 为微调层数,`ex/s` 为每秒训练的样本数。`gradient_accumulation_steps` 参数设置为 `1`。上述结果均来自于单个 Tesla V100 GPU,仅供参考。
## 微调 ChatGLM 的例子
Expand Down
19 changes: 19 additions & 0 deletions examples/finetune_with_quant_and_local_model.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#!/bin/bash

CUDA_VISIBLE_DEVICES=0 python ../src/finetune.py \
--do_train \
--dataset alpaca_gpt4_zh \
--dataset_dir ../data \
--finetuning_type lora \
--output_dir path_to_checkpoint \
--overwrite_cache \
--per_device_train_batch_size 4 \
--gradient_accumulation_steps 4 \
--lr_scheduler_type cosine \
--logging_steps 10 \
--save_steps 1000 \
--learning_rate 5e-5 \
--num_train_epochs 1.0 \
--fp16 \
--quantization_bit 8 \
--model_name_or_path path_to_base_model
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
torch>=2.0.0
torch>=1.13.1
protobuf
cpm_kernels
sentencepiece
Expand Down
5 changes: 4 additions & 1 deletion src/utils/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,10 @@ def load_pretrained(
if stage == "ppo": # load reward model
model.pretrained_model.load_adapter(model_args.reward_model, "reward", is_trainable=False)
load_valuehead_params(model, model_args.reward_model)

# Set the parameter _is_int8_training_enabled for the AutoModelForCausalLMWithValueHead model
# To meet the compliance requirements of the transformers library
if quantization == "hf" and model_args.quantization_bit == 8:
model._is_int8_training_enabled = True
print_trainable_params(model)

return model, tokenizer
Expand Down

0 comments on commit 6d160c3

Please sign in to comment.