Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用internlm-xcomposer2_5-7b-chat进行dpo训练,数据报错 #1831

Closed
RBBB2010 opened this issue Aug 27, 2024 · 6 comments
Closed

使用internlm-xcomposer2_5-7b-chat进行dpo训练,数据报错 #1831

RBBB2010 opened this issue Aug 27, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@RBBB2010
Copy link

RBBB2010 commented Aug 27, 2024

首先,非常感谢SWIFT带来的便利应用!

我在使用自己lora SFT并merge之后的模型进行dpo训练,完全按照rlhf数据格式创建自己的数据集,但在collate_fn会报错如下:

Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:20<00:00, 4.28 examples/s]
Train: 0%| | 0/6 [00:00<?, ?it/s]

[rank1]: Original Traceback (most recent call last):
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)

我排查了自己数据的问题,但仍然会报这样的错。于是我把数据集换成了公开数据集--dataset rlaif-v#1000 \,collate_fn仍会报错相同错误:
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)

具体观察发现这里k=prompt_labels时会出现ex[k]是NoneType。请问这是什么问题导致的呢?

我的sh脚本如下:
swift rlhf
--rlhf_type dpo
--model_type internlm-xcomposer2_5-7b-chat
--model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--ref_model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--dataset rlaif-v#90
--dtype bf16
--beta 0.1
--sft_type lora
--init_lora_weights 'pissa'
--use_flash_attn true
--num_train_epochs 4
--gradient_checkpointing true
--batch_size 2 \

@RBBB2010 RBBB2010 reopened this Aug 27, 2024
@hjh0119 hjh0119 self-assigned this Aug 27, 2024
@hjh0119 hjh0119 added the bug Something isn't working label Aug 27, 2024
@hjh0119
Copy link
Collaborator

hjh0119 commented Aug 28, 2024

fixed in #1838

@hjh0119 hjh0119 closed this as completed Aug 28, 2024
@RBBB2010
Copy link
Author

您好,感谢您对数据报错问题的帮助。
我现在在dpo微调时又遇到了新的问题,我使用如下脚本,设备是2*A00 80GB,但开始微调会OOM,我已经按照官方文档去使用device map(去掉NPROC_PER_NODE),也尝试过使用deepspeed,都会爆显存。请问有什么解决办法吗?

CUDA_VISIBLE_DEVICES=0,1
MASTER_PORT=29500
swift rlhf
--rlhf_type dpo
--model_type internlm-xcomposer2_5-7b-chat
--model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged
--ref_model_id_or_path /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/internlm-xcomposer2_5-7b-chat/v8/checkpoint-638-merged
--output_dir /InternLM_Xcomposer2d5/output/swift_finetune_lora_2/dpo
--dataset /swift/data/dpo_demo.json
--dtype bf16
--beta 0.1
--sft_beta 0.1
--sft_type lora
--init_lora_weights 'pissa'
--lora_rank 128
--lora_alpha 256
--lora_dropout_p 0.1
--lora_target_modules DEFAULT
--use_flash_attn true
--num_train_epochs 3
--gradient_checkpointing true
--batch_size 1
--learning_rate 1e-6
--gradient_accumulation_steps 16
--warmup_ratio 0.01
--save_total_limit 20
--max_length 10240
--save_steps 20
--eval_steps 20
--model_kwargs '{"hd_num": 16}' \

@tastelikefeet
Copy link
Collaborator

--model_kwargs '{"hd_num": 16}' 降低一下试试

@RBBB2010
Copy link
Author

嗯嗯hd_num降低确实可以解决这个问题,但我的图片数据本身分辨率比较高,hd_num降低可能会影响训练效果。
想请问您还有没有别的办法呢?因为这个显存占用确实比我计算下来的显存占用高一些。。。

@RBBB2010
Copy link
Author

RBBB2010 commented Aug 29, 2024

您好,我在拉取更新后的最新分支进行dpo的训练到一定阶段会报错,请问这是什么原因导致的
Train: 6%|█████████▋ | 40/618 [44:27<10:42:31, 66.70s/it]
File "/home/star/miniconda3/envs/zailiu_dpo/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i
s torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i
s torch.Size([6144, 256]).
size mismatch for base_model.model.model.layers.0.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is
torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is
torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model
is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.0.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current model
is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode$
is torch.Size([256, 14336]).
size mismatch for base_model.model.model.layers.0.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model
is torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model i
s torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.1.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model i
s torch.Size([6144, 256]).

后续还很长,最后一直到
size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.attention.wqkv.lora_B.initial_model.weight: copying a param with shape torch.Size([6144, 128]) from checkpoint, the shape in current model
is torch.Size([6144, 256]).
size mismatch for base_model.model.model.layers.31.attention.wo.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model is
torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.attention.wo.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model is
torch.Size([4096, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.feed_forward.w1.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode
l is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 4096]) from checkpoint, the shape in current model
is torch.Size([256, 4096]).
size mismatch for base_model.model.model.layers.31.feed_forward.w3.lora_B.initial_model.weight: copying a param with shape torch.Size([14336, 128]) from checkpoint, the shape in current mode
l is torch.Size([14336, 256]).
size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_A.initial_model.weight: copying a param with shape torch.Size([128, 14336]) from checkpoint, the shape in current mode
l is torch.Size([256, 14336]).
size mismatch for base_model.model.model.layers.31.feed_forward.w2.lora_B.initial_model.weight: copying a param with shape torch.Size([4096, 128]) from checkpoint, the shape in current model
is torch.Size([4096, 256]).

@RBBB2010
Copy link
Author

在退回上一个版本并手动加入 #1838的修改后可以正常训练

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants