Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-gpu full para train error #855

Open
tankeui opened this issue Jun 12, 2024 · 2 comments
Open

multi-gpu full para train error #855

tankeui opened this issue Jun 12, 2024 · 2 comments

Comments

@tankeui
Copy link

tankeui commented Jun 12, 2024

[2024-06-12 19:36:07,800] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:09,648] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-06-12 19:36:09,648] [INFO] [runner.py:568:main] cmd = anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None LMFlow/examples/finetune.py --model_name_or_path huggingface/hub/Meta-Llama-3-70B --trust_remote_code 0 --dataset_path LMFlow/data/alpaca/train_conversation --output_dir output_models/finetune --overwrite_output_dir --conversation_template llama3 --num_train_epochs 0.01 --learning_rate 2e-5 --disable_group_texts 1 --block_size 256 --per_device_train_batch_size 1 --deepspeed LMFlow/configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2024-06-12 19:36:11,661] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:12,366] [INFO] [launch.py:138:main] 0 TORCH_NCCL_BLOCKING_WAIT=1
[2024-06-12 19:36:12,366] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-06-12 19:36:12,366] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-06-12 19:36:12,366] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-06-12 19:36:12,366] [INFO] [launch.py:163:main] dist_world_size=2
[2024-06-12 19:36:12,366] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-06-12 19:36:12,419] [INFO] [launch.py:253:main] process 40472 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:12,466] [INFO] [launch.py:253:main] process 40473 spawned with command: ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1']
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-06-12 19:36:17,298] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-06-12 19:36:20,965] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-06-12 19:36:20,965] [INFO] [comm.py:637:init_distributed] cdb=None
[rank1]: Traceback (most recent call last):
[rank1]: File "LMFlow/examples/finetune.py", line 61, in
[rank1]: main()
[rank1]: File "LMFlow/examples/finetune.py", line 44, in main
[rank1]: model_args, data_args, pipeline_args = parser.parse_args_into_dataclasses()
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/hf_argparser.py", line 339, in parse_args_into_dataclasses
[rank1]: obj = dtype(**inputs)
[rank1]: File "", line 135, in init
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 1641, in post_init
[rank1]: and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2149, in device
[rank1]: return self._setup_devices
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/utils/generic.py", line 59, in get
[rank1]: cached = self.fget(obj)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/training_args.py", line 2077, in _setup_devices
[rank1]: self.distributed_state = PartialState(timeout=timedelta(seconds=self.ddp_timeout))
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 280, in init
[rank1]: self.set_device()
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/accelerate/state.py", line 790, in set_device
[rank1]: torch.cuda.set_device(self.device)
[rank1]: File "anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/cuda/init.py", line 399, in set_device
[rank1]: torch._C._cuda_setDevice(device)
[rank1]: RuntimeError: CUDA error: invalid device ordinal
[rank1]: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

06/12/2024 19:36:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=None' instead.
warnings.warn(
[2024-06-12 19:36:22,477] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40472
[2024-06-12 19:36:22,531] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 40473
[2024-06-12 19:36:22,531] [ERROR] [launch.py:322:sigkill_handler] ['anaconda3/envs/lmflow/bin/python', '-u', 'LMFlow/examples/finetune.py', '--local_rank=1', '--model_name_or_path', 'huggingface/hub/Meta-Llama-3-70B', '--trust_remote_code', '0', '--dataset_path', 'LMFlow/data/alpaca/train_conversation', '--output_dir', 'output_models/finetune', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '256', '--per_device_train_batch_size', '1', '--deepspeed', 'LMFlow/configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

`#!/bin/bash

Please run this script under ${project_id} in project directory of

https://github.com/shizhediao/llm-ft

COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

export TORCH_SHOW_CPP_STACKTRACES = 1

export TORCH_NCCL_BLOCKING_WAIT=1
export CUDA_LAUNCH_BLOCKING=1

export TORCH_USE_CUDA_DSA=1

Parses arguments

model_name_or_path=huggingface/hub/Meta-Llama-3-70B
dataset_path=LMFlow/data/alpaca/train_conversation
output_dir=output_models/finetune
deepspeed_args="--num_gpus=2 --master_port=11000"
conversation_template=llama3

Safety related arguments

trust_remote_code=0

while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--conversation_template)
conversation_template="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
--trust_remote_code)
trust_remote_code="$2"
shift
;;
*)
echo "error: unknown option "${key}"" 1>&2
exit 1
esac
shift
done

Finetune

exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args}
LMFlow/examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir} --overwrite_output_dir
--conversation_template ${conversation_template}
--num_train_epochs 0.01
--learning_rate 2e-5
--disable_group_texts 1
--block_size 256
--per_device_train_batch_size 1
--deepspeed LMFlow/configs/ds_config_zero3.json
--fp16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
| tee ${log_dir}/train.log
2> ${log_dir}/train.err`

How can I fix this problem?
@wheresmyhair
Copy link
Collaborator

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at:
https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal
Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"

@tankeui
Copy link
Author

tankeui commented Jun 12, 2024

It seems like a cuda device mismatch issue.

[rank1]: RuntimeError: CUDA error: invalid device ordinal

I guess you've set CUDA_VISIBLE_DEVICES somewhere else accidently and leads to a mismatch. Maybe look at: https://stackoverflow.com/questions/64334033/how-to-solve-runtimeerror-cuda-error-invalid-device-ordinal Or, try change:

deepspeed_args="--num_gpus=2 --master_port=11000" 

to

deepspeed_args="--include localhost:x,x --master_port=11000"

Thanks, it solves my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants