Training stops for `KTO` after model loads into memory. #1938

Aunali321 · 2024-09-04T18:55:05Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

The process stops after loading the model into memory and processing dataset. I also tried another dataset that worked before (15-25 days ago) but it's not working now. this same configuration worked 15-25 days ago.
I also tried using trl==0.9.6 having but same issues. I also tried switching servers between different vendors and using H100s instead of A100s.

Training arguments:

USE_HF=1 \
CUDA_VISIBLE_DEVICES=0,1 \
swift rlhf \
    --rlhf_type kto \
    --model_type llama3-70b-instruct \
    --model_id_or_path ~/models/llama3-70b-instruct \
    --beta 0.1 \
    --desirable_weight 1.0 \
    --undesirable_weight 1.0 \
    --model_revision master \
    --sft_type lora \
    --tuner_backend peft \
    --template_type AUTO \
    --dtype AUTO \
    --output_dir output \
    --dataset ~/rlhf/stage3/small_refusals_kto.jsonl \
    --num_train_epochs 1 \
    --max_length 8192 \
    --check_dataset_strategy warning \
    --lora_rank 32 \
    --lora_alpha 64 \
    --lora_dropout_p 0.00 \
    --lora_target_modules DEFAULT \
    --gradient_checkpointing true \
    --batch_size 1 \
    --weight_decay 0.0 \
    --learning_rate 2e-4 \
    --gradient_accumulation_steps 2 \
    --max_grad_norm 0.5 \
    --warmup_ratio 0.03 \
    --eval_steps 100 \
    --save_steps 100 \
    --save_total_limit 2 \
    --logging_steps 10 \
    --use_flash_attn true

Logs:
Had to use pastebin because of github issue body limit.
Pastebin.

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

GPUs: 2xA100 from Massed Compute

Additional context
Add any other context about the problem here(在这里补充其他信息)

The text was updated successfully, but these errors were encountered:

tastelikefeet · 2024-09-05T03:18:55Z

Seems a sudden death, I do think there is a memory problem. Can you please observe the memory usage when running this training?

Aunali321 · 2024-09-05T07:04:25Z

I had the same config with same dataset and model which worked. But i will check

Aunali321 · 2024-09-05T08:29:55Z

Here is my gpu usage at the crash point:
GPU 1: 71971 MiB
GPU 2: 71975 MiB

After the crash both become 1 MiB.
Before the crash, the memory kept filling up while loading the model.

Aunali321 · 2024-09-05T10:46:50Z

it also happens when quantized to 8 bit.
GPU 1: 35301 MiB
GPU 2: 43507 MiB

there is ~77GB free memory.

Aunali321 · 2024-09-06T09:05:49Z

Okay so the the training works as expected on Azure servers but i these issues on TensorDock and Massed Compute. All the servers had 2xA100 80GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training stops for `KTO` after model loads into memory. #1938

Training stops for `KTO` after model loads into memory. #1938

Aunali321 commented Sep 4, 2024

tastelikefeet commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 6, 2024 •

edited

Loading

Training stops for KTO after model loads into memory. #1938

Training stops for KTO after model loads into memory. #1938

Comments

Aunali321 commented Sep 4, 2024

tastelikefeet commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 5, 2024

Aunali321 commented Sep 6, 2024 • edited Loading

Training stops for `KTO` after model loads into memory. #1938

Training stops for `KTO` after model loads into memory. #1938

Aunali321 commented Sep 6, 2024 •

edited

Loading