[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

nicosouth · 2024-05-22T03:26:39Z

Running tokenizer on dataset (num_proc=2): 0%| | 0/666 [00:00<?, ? examples/s]
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 61, in
[rank0]: main()
[rank0]: File "/data/mnt/LMFlow-20240514/examples/finetune.py", line 57, in main
[rank0]: tuned_model = finetuner.tune(model=model, dataset=dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/pipeline/finetuner.py", line 237, in tune
[rank0]: tokenized_dataset = model.tokenize(dataset)
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/models/hf_decoder_model.py", line 622, in tokenize
[rank0]: tokenized_datasets = raw_datasets.map(
[rank0]: File "/data/mnt/LMFlow-20240514/src/lmflow/datasets/dataset.py", line 371, in map
[rank0]: mapped_backend_dataset = self.backend_dataset.map(*args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3189, in map
[rank0]: for rank, done, content in iflatmap_unordered(
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in iflatmap_unordered
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 1394, in
[rank0]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 771, in get
[rank0]: raise self._value
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/pool.py", line 537, in _handle_tasks
[rank0]: put(task)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/connection.py", line 214, in send
[rank0]: self._send_bytes(_ForkingPickler.dumps(obj))
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/multiprocess/reduction.py", line 54, in dumps
[rank0]: cls(buf, protocol, *args, **kwds).dump(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 498, in dump
[rank0]: StockPickler.dump(self, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 487, in dump
[rank0]: self.save(obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 886, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
[rank0]: StockPickler.save_dict(pickler, obj)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 971, in save_dict
[rank0]: self._batch_setitems(obj.items())
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 997, in _batch_setitems
[rank0]: save(v)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1493, in save_function
[rank0]: pickler.save_reduce(_create_function, (obj.code,
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 692, in save_reduce
[rank0]: save(args)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 901, in save_tuple
[rank0]: save(element)
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/pickle.py", line 560, in save
[rank0]: f(self, obj) # Call unbound method with explicit self
[rank0]: File "/data/llmpt/anaconda3/envs/lmflow240514/lib/python3.9/site-packages/dill/_dill.py", line 1226, in save_cell
[rank0]: f = obj.cell_contents
[rank0]: ValueError: Cell is empty

wheresmyhair · 2024-05-22T09:02:47Z

Thanks for your interest in LMFlow! Could you please provide your .sh script? Also, what kind of dataset are you using?

nicosouth · 2024-05-22T09:43:52Z

ok, this is my script, i just add the "--preprocessing_num_workers 4"

"""""""""
model_name_or_path=/home/llm/model/Qwen1.5-1.8B
dataset_path=/home/llm/data/text_test/
output_dir=/home/llm/model/output_models/finetune
conversation_template=empty
trust_remote_code=True

while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--conversation_template)
conversation_template="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
--trust_remote_code)
trust_remote_code="$2"
shift
;;
*)
echo "error: unknown option "${key}"" 1>&2
exit 1
esac
shift
done

deepspeed --include="localhost:5" --master_port=11999
examples/finetune.py
--model_name_or_path ${model_name_or_path}
--trust_remote_code ${trust_remote_code}
--dataset_path ${dataset_path}
--output_dir ${output_dir}
--conversation_template ${conversation_template}
--num_train_epochs 1
--learning_rate 2e-5
--disable_group_texts 1
--block_size 1024
--per_device_train_batch_size 1
--deepspeed configs/ds_config_zero0.json
--bf16
--run_name finetune
--validation_split_percentage 0
--logging_steps 20
--do_train
--ddp_timeout 72000
--save_steps 5000
--dataloader_num_workers 1
--preprocessing_num_workers 4
| tee ${log_dir}/train.log
2> ${log_dir}/train.err
"""""""""

i use the ShuSheng dataset and convert data into the format required by lmflow.

thank you!

wheresmyhair · 2024-05-22T11:07:45Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

nicosouth · 2024-05-22T11:28:32Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

wheresmyhair · 2024-05-22T13:47:52Z

i use the ShuSheng dataset and convert data into the format required by lmflow.

What's the type of that dataset, is it text_only, text2text, or conversation?

it's text_only.

We do repro this bug now and we are working on fixing it. Perhaps finetune with --preprocessing_num_workers 1 for now, and sorry for the inconvenience 🙏 If you have any other questions, please feel free to leave a comment.

nicosouth · 2024-05-24T03:00:16Z

thank you for your contributions

wheresmyhair · 2024-05-30T03:18:59Z

thank you for your contributions

FYI: We've located the bug, and dev team needs to perform a small-scale refactoring to fix. We will do ASAP and sorry for the inconvenience 🙏

wheresmyhair · 2024-05-31T02:10:48Z

thank you for your contributions

FYI: Bug fixed, please see #845 🤗

wheresmyhair added the pending Something isn't working label May 22, 2024

wheresmyhair removed the pending Something isn't working label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 24, 2024

wheresmyhair commented May 30, 2024

wheresmyhair commented May 31, 2024

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

[BUG]when map the dataset, i set the num_proc = 2 or 4, it will make mistakes. #840

Comments

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 22, 2024

wheresmyhair commented May 22, 2024

nicosouth commented May 24, 2024

wheresmyhair commented May 30, 2024

wheresmyhair commented May 31, 2024