Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795

Open
bolli20000 opened this issue Sep 7, 2024 · 3 comments

Comments

@bolli20000
Copy link

Training process always stops / freezes at epoch1 - cannot train - anybody an idea ? Please help.

here is the Protocol:

To create a public link, set share=True in launch().
15:58:52-915530 INFO Loading config...
15:59:22-223452 INFO Start training Dreambooth...
15:59:22-224452 INFO Validating lr scheduler arguments...
15:59:22-225451 INFO Validating optimizer arguments...
15:59:22-226452 INFO Validating D:/Bilder/Project_AI/Train/model existence and writability... SUCCESS
15:59:22-240151 INFO Validating C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors existence... SUCCESS
15:59:22-241152 INFO Validating D:/Bilder/Project_AI/Train/BolliHotShots existence... SUCCESS
15:59:22-242152 INFO Folder 1_BHS Man: 1 repeats found
15:59:22-243152 INFO Folder 1_BHS Man: 51 images found
15:59:22-243152 INFO Folder 1_BHS Man: 51 * 1 = 51 steps
15:59:22-244151 INFO Regulatization factor: 1
15:59:22-244151 INFO Total steps: 51
15:59:22-245152 INFO Train batch size: 1
15:59:22-245152 INFO Gradient accumulation steps: 1
15:59:22-246151 INFO Epoch: 200
15:59:22-247151 INFO max_train_steps (51 / 1 / 1 * 200 * 1) = 10200
15:59:22-249151 INFO lr_warmup_steps = 0
15:59:22-252151 INFO Saving training config to D:/Bilder/Project_AI/Train/model\BolliHotsShots_Latest_Flux_20240907-155922.json...
15:59:22-253153 INFO Executing command: D:\train_flux\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 2 D:/train_flux/kohya_ss/sd-scripts/flux_train.py --config_file D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml
D:\train_flux\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
D:\train_flux\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
D:\train_flux\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
D:\train_flux\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
2024-09-07 15:59:31 INFO Loading settings from D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml... train_util.py:4190
INFO D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922 train_util.py:4209
2024-09-07 15:59:31 INFO Using DreamBooth method. flux_train.py:101
INFO prepare images. train_util.py:1803
INFO get image size from name of cache files train_util.py:1741
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 5614.57it/s]
INFO set image size from cache files: 0/51 train_util.py:1748
INFO found directory D:\Bilder\Project_AI\Train\BolliHotShots\1_BHS Man contains 51 image files train_util.py:1750
INFO 51 train images with repeating. train_util.py:1844
INFO 0 reg images. train_util.py:1847
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1852
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True

                           [Subset 0 of Dataset 0]
                             image_dir: "D:\Bilder\Project_AI\Train\BolliHotShots\1_BHS Man"
                             image_count: 51
                             num_repeats: 1
                             shuffle_caption: False
                             keep_tokens: 0
                             keep_tokens_separator:
                             caption_separator: ,
                             secondary_separator: None
                             enable_wildcard: False
                             caption_dropout_rate: 0
                             caption_dropout_every_n_epoches: 0
                             caption_tag_dropout_rate: 0.0
                             caption_prefix: None
                             caption_suffix: None
                             color_aug: False
                             flip_aug: False
                             face_crop_aug_range: None
                             random_crop: False
                             token_warmup_min: 1,
                             token_warmup_step: 0,
                             alpha_mask: False,
                             is_reg: False
                             class_tokens: BHS Man
                             caption_extension: .txt


                INFO     [Dataset 0]                                                                                                                                                       config_util.py:576
                INFO     loading image sizes.                                                                                                                                               train_util.py:876

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 1826.90it/s]
INFO make buckets train_util.py:882
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:899
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:928
INFO bucket 0: resolution (512, 768), count: 1 train_util.py:933
INFO bucket 1: resolution (768, 1152), count: 1 train_util.py:933
INFO bucket 2: resolution (768, 1216), count: 1 train_util.py:933
INFO bucket 3: resolution (768, 1280), count: 2 train_util.py:933
INFO bucket 4: resolution (832, 1088), count: 3 train_util.py:933
INFO bucket 5: resolution (832, 1152), count: 2 train_util.py:933
INFO bucket 6: resolution (832, 1216), count: 7 train_util.py:933
INFO bucket 7: resolution (896, 1024), count: 3 train_util.py:933
INFO bucket 8: resolution (896, 1088), count: 4 train_util.py:933
INFO bucket 9: resolution (896, 1152), count: 5 train_util.py:933
INFO bucket 10: resolution (960, 960), count: 1 train_util.py:933
INFO bucket 11: resolution (960, 1088), count: 1 train_util.py:933
INFO bucket 12: resolution (1024, 960), count: 2 train_util.py:933
INFO bucket 13: resolution (1024, 1024), count: 1 train_util.py:933
INFO bucket 14: resolution (1088, 832), count: 2 train_util.py:933
INFO bucket 15: resolution (1088, 896), count: 2 train_util.py:933
INFO bucket 16: resolution (1216, 832), count: 13 train_util.py:933
INFO mean ar error (without repeats): 0.01883627399857184 train_util.py:938
INFO prepare accelerator flux_train.py:171
accelerator device: cuda
INFO Building AutoEncoder flux_utils.py:62
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/vae/ae.safetensors flux_utils.py:66
INFO Loaded AE: flux_utils.py:69
2024-09-07 15:59:32 INFO [Dataset 0] train_util.py:2324
INFO caching latents with caching strategy. train_util.py:984
INFO checking cache validity... train_util.py:994
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<?, ?it/s]
INFO caching latents... train_util.py:1038
0%| | 0/51 [00:00<?, ?it/s]D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py:79: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
h_ = nn.functional.scaled_dot_product_attention(q, k, v)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:14<00:00, 3.54it/s]
D:\train_flux\kohya_ss\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
2024-09-07 15:59:47 INFO Building CLIP flux_utils.py:74
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/clip_l.safetensors flux_utils.py:167
INFO Loaded CLIP: flux_utils.py:170
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/t5xxl_fp16.safetensors flux_utils.py:215
INFO Loaded T5xxl: flux_utils.py:218
INFO Building Flux model dev flux_utils.py:45
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors flux_utils.py:52
INFO Loaded Flux: flux_utils.py:55
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0 flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use AdamW optimizer | {} train_util.py:4541
running training / 学習開始
num examples / サンプル数: 51
num batches per epoch / 1epochのバッチ数: 51
num epochs / epoch数: 200
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 10200
steps: 0%| | 0/10200 [00:00<?, ?it/s]
epoch 1/200
2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]

@bolli20000
Copy link
Author

After some minutes to wait training ends with the following traceback:

2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
Traceback (most recent call last):
File "D:\train_flux\kohya_ss\sd-scripts\flux_train.py", line 905, in
train(args)
File "D:\train_flux\kohya_ss\sd-scripts\flux_train.py", line 736, in train
accelerator.backward(loss)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py", line 2159, in backward
loss.backward(**kwargs)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch_tensor.py", line 521, in backward
torch.autograd.backward(
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\autograd_init_.py", line 289, in backward
_engine_run_backward(
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\autograd\graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 1116, in unpack_hook
frame.recompute_fn(*args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py", line 1400, in recompute_fn
fn(*args, **kwargs)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 720, in forward
attn = attention(q, k, v, pe=pe, attn_mask=attn_mask)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 444, in attention
q, k = apply_rope(q, k, pe)
File "D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py", line 464, in apply_rope
xk
= xk.float().reshape(*xk.shape[:-1], -1, 1, 2)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 52.00 MiB. GPU 0 has a total capacity of 23.99 GiB of which 0 bytes is free. Of the allocated memory 52.22 GiB is allocated by PyTorch, and 1.61 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
steps: 0%| | 0/10200 [14:31<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\sabot\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "D:\train_flux\kohya_ss\venv\Scripts\accelerate.EXE_main
.py", line 7, in
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "D:\train_flux\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\train_flux\kohya_ss\venv\Scripts\python.exe', 'D:/train_flux/kohya_ss/sd-scripts/flux_train.py', '--config_file', 'D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml']' returned non-zero exit status 1.
16:15:09-408784 INFO Training has ended.

@bolli20000
Copy link
Author

at the end an torch.OutOfMemoryError: CUDA out of memory. ?

@PsiClone99
Copy link

Are you sure you are training Lora and not Dreambooth? or did u by chance during ur last run save a config file in dreambooth tab and load it in Lora tab ? Please start a new instance and set settings manually, and try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants