Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

Open
nicosouth opened this issue May 22, 2024 · 8 comments

Comments

@nicosouth
Copy link

Running tokenizer on dataset (num_proc=2): 0%| | 0/666 [00:00<?, ? examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 61, in
[rank0]: main()
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 57, in main
[rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/pipeline/finetuner.py", line 237, in tune
[rank0]: tokenized_dataset = model.tokenize(dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/models/hf_decoder_model.py", line 622, in tokenize
[rank0]: tokenized_datasets = raw_datasets.map(
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/datasets/dataset.py", line 371, in map
[rank0]: mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3189, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
[rank0]: raise self._value
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
[rank0]: put(task)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send
[rank0]: self._send_bytes(_ForkingPickler.dumps(obj))
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps
[rank0]: cls(buf, protocol, *args, **kwds).dump(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump
[rank0]: StockPickler.dump(self, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 487, in dump
[rank0]: self.save(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 886, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
[rank0]: StockPickler.save_dict(pickler, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 971, in save_dict
[rank0]: self._batch_setitems(obj.items())
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 997, in _batch_setitems
[rank0]: save(v)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function
[rank0]: pickler.save_reduce(_create_function, (obj.code,
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 692, in save_reduce
[rank0]: save(args)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1226, in save_cell
[rank0]: f = obj.cell_contents
[rank0]: ValueError: Cell is empty

@wheresmyhair
Copy link
Collaborator

Thanks for your interest in LMFlow! Could you please provide your .sh script? Also, what kind of dataset are you using?

@nicosouth
Copy link
Author

ok, this is my script, i just add the "--preprocessing_num_workers 4"

"""""""""
model_name_or_path=/home/llm/model/Qwen1.5-1.8B
dataset_path=/home/llm/data/text_test/
output_dir=/home/llm/model/output_models/finetune
conversation_template=empty
trust_remote_code=True

while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--conversation_template)
conversation_template="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
--trust_remote_code)
trust_remote_code="$2"
shift
;;
*)
echo "error: unknown option "${key}"" 1>&2
exit 1
esac
shift
done

deepspeed --include="localhost:5" --master_port=11999
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir}
--conversation_template ${conversation_template}
--num_train_epochs 1
--learning_rate 2e-5
--disable_group_texts 1
--block_size 1024
--per_device_train_batch_size 1
--deepspeed configs/ds_config_zero0.json
--bf16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
--preprocessing_num_workers 4
| tee ${log_dir}/train.log
2> ${log_dir}/train.err
"""""""""

i use the ShuSheng dataset and convert data into the format required by lmflow.

thank you!

@wheresmyhair
Copy link
Collaborator

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

@nicosouth
Copy link
Author

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

@wheresmyhair
Copy link
Collaborator

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

We do repro this bug now and we are working on fixing it. Perhaps finetune with --preprocessing_num_workers 1 for now, and sorry for the inconvenience 🙏 If you have any other questions, please feel free to leave a comment.

@wheresmyhair wheresmyhair added the pending Something isn't working label May 22, 2024
@nicosouth
Copy link
Author

thank you for your contributions

@wheresmyhair
Copy link
Collaborator

thank you for your contributions

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

@wheresmyhair
Copy link
Collaborator

thank you for your contributions

FYI: Bug fixed, please see #845 🤗

@wheresmyhair wheresmyhair removed the pending Something isn't working label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants