-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
使用internlm-xcomposer2_5-7b-chat进行dpo训练,数据报错 #1831
Comments
fixed in #1838 |
您好,感谢您对数据报错问题的帮助。 CUDA_VISIBLE_DEVICES=0,1 |
--model_kwargs '{"hd_num": 16}' 降低一下试试 |
嗯嗯hd_num降低确实可以解决这个问题,但我的图片数据本身分辨率比较高,hd_num降低可能会影响训练效果。 |
您好,我在拉取更新后的最新分支进行dpo的训练到一定阶段会报错,请问这是什么原因导致的 后续还很长,最后一直到 |
在退回上一个版本并手动加入 #1838的修改后可以正常训练 |
首先,非常感谢SWIFT带来的便利应用!
我在使用自己lora SFT并merge之后的模型进行dpo训练,完全按照rlhf数据格式创建自己的数据集,但在collate_fn会报错如下:
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 86/86 [00:20<00:00, 4.28 examples/s]
Train: 0%| | 0/6 [00:00<?, ?it/s]
[rank1]: Original Traceback (most recent call last):
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File "/home/star/miniconda3/envs/dpo/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)
我排查了自己数据的问题,但仍然会报这样的错。于是我把数据集换成了公开数据集--dataset rlaif-v#1000 \,collate_fn仍会报错相同错误:
[rank1]: return self.collate_fn(data)
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in new_call
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: File "/home/star/swift/swift/trainers/utils.py", line 208, in
[rank1]: to_pad = [torch.tensor(ex[k], dtype=dtype) for ex in features]
[rank1]: TypeError: an integer is required (got type NoneType)
具体观察发现这里k=prompt_labels时会出现ex[k]是NoneType。请问这是什么问题导致的呢?
我的sh脚本如下:
swift rlhf
--rlhf_type dpo
--model_type internlm-xcomposer2_5-7b-chat
--model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--ref_model_id_or_path /InternLM_Xcomposer2d5/output/internlm-xcomposer2_5-7b-chat/v1/checkpoint-638-merged
--dataset rlaif-v#90
--dtype bf16
--beta 0.1
--sft_type lora
--init_lora_weights 'pissa'
--use_flash_attn true
--num_train_epochs 4
--gradient_checkpointing true
--batch_size 2 \
The text was updated successfully, but these errors were encountered: