Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training stops for KTO after model loads into memory. #1938

Open
Aunali321 opened this issue Sep 4, 2024 · 5 comments
Open

Training stops for KTO after model loads into memory. #1938

Aunali321 opened this issue Sep 4, 2024 · 5 comments

Comments

@Aunali321
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

The process stops after loading the model into memory and processing dataset. I also tried another dataset that worked before (15-25 days ago) but it's not working now. this same configuration worked 15-25 days ago.
I also tried using trl==0.9.6 having but same issues. I also tried switching servers between different vendors and using H100s instead of A100s.

Training arguments:

USE_HF=1 \
CUDA_VISIBLE_DEVICES=0,1 \
swift rlhf \
    --rlhf_type kto \
    --model_type llama3-70b-instruct \
    --model_id_or_path ~/models/llama3-70b-instruct \
    --beta 0.1 \
    --desirable_weight 1.0 \
    --undesirable_weight 1.0 \
    --model_revision master \
    --sft_type lora \
    --tuner_backend peft \
    --template_type AUTO \
    --dtype AUTO \
    --output_dir output \
    --dataset ~/rlhf/stage3/small_refusals_kto.jsonl \
    --num_train_epochs 1 \
    --max_length 8192 \
    --check_dataset_strategy warning \
    --lora_rank 32 \
    --lora_alpha 64 \
    --lora_dropout_p 0.00 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.0 \
    --learning_rate 2e-4 \
    --gradient_accumulation_steps 2 \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 10 \
    --use_flash_attn true

Logs:
Had to use pastebin because of github issue body limit.
Pastebin.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

GPUs: 2xA100 from Massed Compute

Additional context
Add any other context about the problem here(在这里补充其他信息)

@tastelikefeet
Copy link
Collaborator

Seems a sudden death, I do think there is a memory problem. Can you please observe the memory usage when running this training?

@Aunali321
Copy link
Author

I had the same config with same dataset and model which worked. But i will check

@Aunali321
Copy link
Author

Here is my gpu usage at the crash point:
GPU 1: 71971 MiB
GPU 2: 71975 MiB

After the crash both become 1 MiB.
Before the crash, the memory kept filling up while loading the model.

@Aunali321
Copy link
Author

it also happens when quantized to 8 bit.
GPU 1: 35301 MiB
GPU 2: 43507 MiB

there is ~77GB free memory.

@Aunali321
Copy link
Author

Aunali321 commented Sep 6, 2024

Okay so the the training works as expected on Azure servers but i these issues on TensorDock and Massed Compute. All the servers had 2xA100 80GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants