Full parameter fine-tuning cannot be trained #842

orderer0001 · 2024-05-22T07:44:51Z

(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh
--model_name_or_path /data/guihunmodel8.8B
--dataset_path /data/projects/lmflow/case_report_data
--output_model_path /data/projects/lmflow/guihun_fintune_model
[2024-05-22 15:23:02,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:05,346] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-22 15:23:05,346] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/lmflow_train/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /data/guihunmodel8.8B --trust_remote_code 0 --dataset_path /data/projects/lmflow/case_report_data --output_dir /data/projects/lmflow/guihun_fintune_model --overwrite_output_dir --conversation_template llama2 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-05-22 15:23:07,178] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-05-22 15:23:08,889] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-05-22 15:23:08,889] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-05-22 15:23:08,889] [INFO] [launch.py:163:main] dist_world_size=3
[2024-05-22 15:23:08,889] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2024-05-22 15:23:12,326] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,878] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-05-22 15:23:15,313] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,313] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,317] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,318] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,368] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[WARNING|logging.py:314] 2024-05-22 15:23:18,032 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,186 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-22 15:23:20,000] [INFO] [partition_parameters.py:326:exit] finished initializing model with 8.03B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.06s/it]
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu121/cpu_adam...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/lmflow_train/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -march=native -fopenmp -D__AVX512 -D__DISABLE_CUDA_ -c /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286750555038452 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286848306655884 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.370280504226685 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-05-22 15:36:23,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806929
[2024-05-22 15:36:23,707] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806930
[2024-05-22 15:36:28,465] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806931
[2024-05-22 15:36:33,281] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/lmflow_train/bin/python', '-u', 'examples/finetune.py', '--local_rank=2', '--model_name_or_path', '/data/guihunmodel8.8B', '--trust_remote_code', '0', '--dataset_path', '/data/projects/lmflow/case_report_data', '--output_dir', '/data/projects/lmflow/guihun_fintune_model', '--overwrite_output_dir', '--conversation_template', 'llama2', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

wheresmyhair · 2024-05-22T09:12:32Z

Thanks for your interest in LMFlow! It seems that your system installed cuda and torch cuda do not match each other.
You may refer to: microsoft/DeepSpeed#3613
Feel free to leave a comment if you need further helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full parameter fine-tuning cannot be trained #842

Full parameter fine-tuning cannot be trained #842

orderer0001 commented May 22, 2024

wheresmyhair commented May 22, 2024

Full parameter fine-tuning cannot be trained #842

Full parameter fine-tuning cannot be trained #842

Comments

orderer0001 commented May 22, 2024

wheresmyhair commented May 22, 2024