使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831

RBBB2010 · 2024-08-27T07:44:18Z

首先，非常感谢SWIFT带来的便利应用！

我在使用自己lora SFT并merge之后的模型进行dpo训练，完全按照rlhf数据格式创建自己的数据集，但在collate_fn会报错如下：

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:20<00:00, 4.28 examples/s]
Train: 0%| | 0/6 [00:00<?, ?it/s]

[rank1]: Original Traceback (most recent call last):
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)

我排查了自己数据的问题，但仍然会报这样的错。于是我把数据集换成了公开数据集--dataset rlaif-v#1000 \，collate_fn仍会报错相同错误：
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)

具体观察发现这里k=prompt_labels时会出现ex[k]是NoneType。请问这是什么问题导致的呢？

我的sh脚本如下：
swift rlhf
--rlhf_type dpo
--model_type internlm-xcomposer2_5-7b-chat
--model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--ref_model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--dataset rlaif-v#90
--dtype bf16
--beta 0.1
--sft_type lora
--init_lora_weights 'pissa'
--use_flash_attn true
--num_train_epochs 4
--gradient_checkpointing true
--batch_size 2 \

hjh0119 · 2024-08-28T06:18:07Z

fixed in #1838

RBBB2010 · 2024-08-28T09:43:20Z

您好，感谢您对数据报错问题的帮助。
我现在在dpo微调时又遇到了新的问题，我使用如下脚本，设备是2*A00 80GB，但开始微调会OOM，我已经按照官方文档去使用device map（去掉NPROC_PER_NODE），也尝试过使用deepspeed，都会爆显存。请问有什么解决办法吗？

CUDA_VISIBLE_DEVICES=0,1
MASTER_PORT=29500
swift rlhf
--rlhf_type dpo
--model_type internlm-xcomposer2_5-7b-chat
--model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged
--ref_model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged
--output_dir /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/dpo
--dataset /swift/data/dpo_demo.json
--dtype bf16
--beta 0.1
--sft_beta 0.1
--sft_type lora
--init_lora_weights 'pissa'
--lora_rank 128
--lora_alpha 256
--lora_dropout_p 0.1
--lora_target_modules DEFAULT
--use_flash_attn true
--num_train_epochs 3
--gradient_checkpointing true
--batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 16
--warmup_ratio 0.01
--save_total_limit 20
--max_length 10240
--save_steps 20
--eval_steps 20
--model_kwargs '{"hd_num": 16}' \

tastelikefeet · 2024-08-28T09:45:06Z

--model_kwargs '{"hd_num": 16}' 降低一下试试

RBBB2010 · 2024-08-29T02:10:43Z

嗯嗯hd_num降低确实可以解决这个问题，但我的图片数据本身分辨率比较高，hd_num降低可能会影响训练效果。
想请问您还有没有别的办法呢？因为这个显存占用确实比我计算下来的显存占用高一些。。。

RBBB2010 · 2024-08-29T17:58:38Z

您好，我在拉取更新后的最新分支进行dpo的训练到一定阶段会报错，请问这是什么原因导致的
Train: 6%|█████████▋ | 40/618 [44:27<10:42:31, 66.70s/it]
File "/home/star/miniconda3/envs/zailiu_dpo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i
s torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i
s torch.Size([6144, 256]).
size mismatch for base_model.model.model.layers.0.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is
torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is
torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model
is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model
is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode$
is torch.Size([256, 14336]).
size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model
is torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i
s torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i
s torch.Size([6144, 256]).

后续还很长，最后一直到
size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model
is torch.Size([6144, 256]).
size mismatch for base_model.model.model.layers.31.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is
torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is
torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode
l is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode
l is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode
l is torch.Size([256, 14336]).
size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model
is torch.Size([4096, 256]).

RBBB2010 · 2024-08-30T02:25:00Z

在退回上一个版本并手动加入 #1838的修改后可以正常训练

RBBB2010 closed this as completed Aug 27, 2024

RBBB2010 reopened this Aug 27, 2024

hjh0119 self-assigned this Aug 27, 2024

hjh0119 added the bug Something isn't working label Aug 27, 2024

hjh0119 mentioned this issue Aug 28, 2024

fix internlm-xcomposer rlhf #1838

Merged

4 tasks

hjh0119 closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831

使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831

RBBB2010 commented Aug 27, 2024 •

edited

Loading

hjh0119 commented Aug 28, 2024

RBBB2010 commented Aug 28, 2024

tastelikefeet commented Aug 28, 2024

RBBB2010 commented Aug 29, 2024

RBBB2010 commented Aug 29, 2024 •

edited

Loading

RBBB2010 commented Aug 30, 2024

使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831

使用internlm-xcomposer2_5-7b-chat进行dpo训练，数据报错 #1831

Comments

RBBB2010 commented Aug 27, 2024 • edited Loading

hjh0119 commented Aug 28, 2024

RBBB2010 commented Aug 28, 2024

tastelikefeet commented Aug 28, 2024

RBBB2010 commented Aug 29, 2024

RBBB2010 commented Aug 29, 2024 • edited Loading

RBBB2010 commented Aug 30, 2024

RBBB2010 commented Aug 27, 2024 •

edited

Loading

RBBB2010 commented Aug 29, 2024 •

edited

Loading