Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full parameter fine-tuning cannot be trained #842

Open
orderer0001 opened this issue May 22, 2024 · 1 comment
Open

Full parameter fine-tuning cannot be trained #842

orderer0001 opened this issue May 22, 2024 · 1 comment

Comments

@orderer0001
Copy link

(lmflow_train) root@duxact:/data/projects/lmflow/LMFlow# ./scripts/run_finetune.sh
--model_name_or_path /data/guihunmodel8.8B
--dataset_path /data/projects/lmflow/case_report_data
--output_model_path /data/projects/lmflow/guihun_fintune_model
[2024-05-22 15:23:02,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:05,346] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-05-22 15:23:05,346] [INFO] [runner.py:555:main] cmd = /root/anaconda3/envs/lmflow_train/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMl19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path /data/guihunmodel8.8B --trust_remote_code 0 --dataset_path /data/projects/lmflow/case_report_data --output_dir /data/projects/lmflow/guihun_fintune_model --overwrite_output_dir --conversation_template llama2 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-05-22 15:23:07,178] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_P2P_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=1
[2024-05-22 15:23:08,889] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2]}
[2024-05-22 15:23:08,889] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=3, node_rank=0
[2024-05-22 15:23:08,889] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2]})
[2024-05-22 15:23:08,889] [INFO] [launch.py:163:main] dist_world_size=3
[2024-05-22 15:23:08,889] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2
[2024-05-22 15:23:12,326] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,845] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-05-22 15:23:12,878] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-05-22 15:23:15,313] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,313] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,317] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,318] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-22 15:23:15,368] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-22 15:23:15,368] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
05/22/2024 15:23:16 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[WARNING|logging.py:314] 2024-05-22 15:23:18,032 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,186 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[WARNING|logging.py:314] 2024-05-22 15:23:18,236 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[2024-05-22 15:23:20,000] [INFO] [partition_parameters.py:326:exit] finished initializing model with 8.03B parameters
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.00s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 5/5 [00:15<00:00, 3.06s/it]
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py39_cu121/cpu_adam...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /root/.cache/torch_extensions/py39_cu121 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -I/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/includes -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/TH -isystem /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/include/THC -isystem /root/anaconda3/envs/lmflow_train/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -march=native -fopenmp -D__AVX512 -D__DISABLE_CUDA_ -c /root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o
[2/2] c++ cpu_adam.o -shared -fopenmp -L/root/anaconda3/envs/lmflow_train/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286750555038452 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.286848306655884 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 19.370280504226685 seconds
Parameter Offload: Total persistent parameters: 266240 in 65 params
[2024-05-22 15:36:23,345] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806929
[2024-05-22 15:36:23,707] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806930
[2024-05-22 15:36:28,465] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 806931
[2024-05-22 15:36:33,281] [ERROR] [launch.py:321:sigkill_handler] ['/root/anaconda3/envs/lmflow_train/bin/python', '-u', 'examples/finetune.py', '--local_rank=2', '--model_name_or_path', '/data/guihunmodel8.8B', '--trust_remote_code', '0', '--dataset_path', '/data/projects/lmflow/case_report_data', '--output_dir', '/data/projects/lmflow/guihun_fintune_model', '--overwrite_output_dir', '--conversation_template', 'llama2', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = -9

@wheresmyhair
Copy link
Collaborator

Thanks for your interest in LMFlow! It seems that your system installed cuda and torch cuda do not match each other.
You may refer to: microsoft/DeepSpeed#3613
Feel free to leave a comment if you need further helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants