Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

bolli20000 · 2024-09-07T14:09:17Z

Training process always stops / freezes at epoch1 - cannot train - anybody an idea ? Please help.

here is the Protocol:

To create a 15:58:52-915530 INFO 15:59:22-223452 INFO 15:59:22-224452 INFO 15:59:22-225451 INFO 15:59:22-226452 INFO 15:59:22-240151 INFO 15:59:22-241152 INFO 15:59:22-242152 INFO 15:59:22-243152 INFO 15:59:22-243152 INFO 15:59:22-244151 INFO 15:59:22-244151 INFO 15:59:22-245152 INFO 15:59:22-245152 INFO 15:59:22-246151 INFO Epoch: 200
15:59:22-247151 INFO 15:59:22-249151 INFO 15:59:22-252151 INFO 15:59:22-253153 INFO --num_cpu_threads_per_process D:\train_flux\kohya_ torch.utils._pytree. D:\train_flux\kohya_ @torch.library.impl_ D:\train_flux\kohya_ @torch.library.impl_ D:\train_flux\kohya_ torch.utils._pytree. 2024-09-07 15:59:31 INFO INFO D:/Bilder/P 2024-09-07 15:59:31 INFO INFO prepare images. INFO get 100%|█████ INFO set INFO found INFO 51 INFO 0 reg images. WARNING no INFO [Dataset 0] batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True public link, set share=True in launch().
Loading config...
Start training Dreambooth...
Validating lr scheduler arguments...
Validating optimizer arguments...
Validating D:/Bilder/Project_AI/Train/model existence and writability... SUCCESS
Validating C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors existence... SUCCESS
Validating D:/Bilder/Project_AI/Train/BolliHotShots existence... SUCCESS
Folder 1_BHS Man: 1 repeats found
Folder 1_BHS Man: 51 images found
Folder 1_BHS Man: 51 * 1 = 51 steps
Regulatization factor: 1
Total steps: 51
Train batch size: 1
Gradient accumulation steps: 1
max_train_steps (51 / 1 / 1 * 200 * 1) = 10200
lr_warmup_steps = 0
Saving training config to D:/Bilder/Project_AI/Train/model\BolliHotsShots_Latest_Flux_20240907-155922.json...
Executing command: D:\train_flux\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
2 D:/train_flux/kohya_ss/sd-scripts/flux_train.py --config_file D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml
ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(
ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
abstract("xformers_flash::flash_fwd")
ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
abstract("xformers_flash::flash_bwd")
ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
_register_pytree_node(
Loading settings from D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml... train_util.py:4190
roject_AI/Train/model/config_dreambooth-20240907-155922 train_util.py:4209
Using DreamBooth method. flux_train.py:101
train_util.py:1803
image size from name of cache files train_util.py:1741
████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 5614.57it/s]
image size from cache files: 0/51 train_util.py:1748
directory D:\Bilder\Project_AI\Train\BolliHotShots\1_BHS Man contains 51 image files train_util.py:1750
train images with repeating. train_util.py:1844
train_util.py:1847
regularization images / 正則化画像が見つかりませんでした train_util.py:1852
config_util.py:570

                           [Subset 0 of Dataset 0]
                             image_dir: "D:\Bilder\Project_AI\Train\BolliHotShots\1_BHS Man"
                             image_count: 51
                             num_repeats: 1
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_separator: ,
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             alpha_mask: False,
                             is_reg: False
                             class_tokens: BHS Man
                             caption_extension: .txt


                INFO     [Dataset 0]                                                                                                                                                       config_util.py:576
                INFO     loading image sizes.                                                                                                                                               train_util.py:876

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 1826.90it/s]
INFO make buckets train_util.py:882
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:899
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数（繰り返し回数を含む） train_util.py:928
INFO bucket 0: resolution (512, 768), count: 1 train_util.py:933
INFO bucket 1: resolution (768, 1152), count: 1 train_util.py:933
INFO bucket 2: resolution (768, 1216), count: 1 train_util.py:933
INFO bucket 3: resolution (768, 1280), count: 2 train_util.py:933
INFO bucket 4: resolution (832, 1088), count: 3 train_util.py:933
INFO bucket 5: resolution (832, 1152), count: 2 train_util.py:933
INFO bucket 6: resolution (832, 1216), count: 7 train_util.py:933
INFO bucket 7: resolution (896, 1024), count: 3 train_util.py:933
INFO bucket 8: resolution (896, 1088), count: 4 train_util.py:933
INFO bucket 9: resolution (896, 1152), count: 5 train_util.py:933
INFO bucket 10: resolution (960, 960), count: 1 train_util.py:933
INFO bucket 11: resolution (960, 1088), count: 1 train_util.py:933
INFO bucket 12: resolution (1024, 960), count: 2 train_util.py:933
INFO bucket 13: resolution (1024, 1024), count: 1 train_util.py:933
INFO bucket 14: resolution (1088, 832), count: 2 train_util.py:933
INFO bucket 15: resolution (1088, 896), count: 2 train_util.py:933
INFO bucket 16: resolution (1216, 832), count: 13 train_util.py:933
INFO mean ar error (without repeats): 0.01883627399857184 train_util.py:938
INFO prepare accelerator flux_train.py:171
accelerator device: cuda
INFO Building AutoEncoder flux_utils.py:62
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/vae/ae.safetensors flux_utils.py:66
INFO Loaded AE: flux_utils.py:69
2024-09-07 15:59:32 INFO [Dataset 0] train_util.py:2324
INFO caching latents with caching strategy. train_util.py:984
INFO checking cache validity... train_util.py:994
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<?, ?it/s]
INFO caching latents... train_util.py:1038
0%| | 0/51 [00:00<?, ?it/s]D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py:79: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
h_ = nn.functional.scaled_dot_product_attention(q, k, v)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:14<00:00, 3.54it/s]
D:\train_flux\kohya_ss\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
2024-09-07 15:59:47 INFO Building CLIP flux_utils.py:74
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/clip_l.safetensors flux_utils.py:167
INFO Loaded CLIP: flux_utils.py:170
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/t5xxl_fp16.safetensors flux_utils.py:215
INFO Loaded T5xxl: flux_utils.py:218
INFO Building Flux model dev flux_utils.py:45
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors flux_utils.py:52
INFO Loaded Flux: flux_utils.py:55
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0 flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use AdamW optimizer | {} train_util.py:4541
running training / 学習開始
num examples / サンプル数: 51
num batches per epoch / 1epochのバッチ数: 51
num epochs / epoch数: 200
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 10200
steps: 0%| | 0/10200 [00:00<?, ?it/s]
epoch 1/200
2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

The text was updated successfully, but these errors were encountered:

bolli20000 · 2024-09-07T14:16:10Z

After some minutes to wait training ends with the following traceback:

2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Traceback (most recent call last):
File "D:\train_flux\kohya_ss\sd-scripts\flux_train.py", line 905, in
train(args)
File "D:\train_flux\kohya_ss\sd-scripts\flux_train.py", line 736, in train
accelerator.backward(loss)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 2159, in backward
loss.backward(**kwargs)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch_tensor.py", line 521, in backward
torch.autograd.backward(
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\autograd_init_.py", line 289, in backward
_engine_run_backward(
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\autograd\graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 1116, in unpack_hook
frame.recompute_fn(*args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 1400, in recompute_fn
fn(*args, **kwargs)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 720, in forward
attn = attention(q, k, v, pe=pe, attn_mask=attn_mask)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 444, in attention
q, k = apply_rope(q, k, pe)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 464, in apply_rope
xk = xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Of the allocated memory 52.22 GiB is allocated by PyTorch, and 1.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
steps: 0%| | 0/10200 [14:31<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\train_flux\kohya_ss\venv\Scripts\accelerate.EXE_main.py", line 7, in
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\train_flux\kohya_ss\venv\Scripts\python.exe', 'D:/train_flux/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml']' returned non-zero exit status 1.
16:15:09-408784 INFO Training has ended.

bolli20000 · 2024-09-07T14:18:00Z

at the end an torch.OutOfMemoryError: CUDA out of memory. ?

PsiClone99 · 2024-09-09T09:44:07Z

Are you sure you are training Lora and not Dreambooth? or did u by chance during ur last run save a config file in dreambooth tab and load it in Lora tab ? Please start a new instance and set settings manually, and try again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

bolli20000 commented Sep 7, 2024

bolli20000 commented Sep 7, 2024

bolli20000 commented Sep 7, 2024

PsiClone99 commented Sep 9, 2024

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

Comments

bolli20000 commented Sep 7, 2024

bolli20000 commented Sep 7, 2024

bolli20000 commented Sep 7, 2024

PsiClone99 commented Sep 9, 2024