-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot train: kohya_ss on branch sd3-flux.1 on latest version causes training to stop and freeze #2795
Comments
After some minutes to wait training ends with the following traceback: 2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668 |
at the end an torch.OutOfMemoryError: CUDA out of memory. ? |
Are you sure you are training Lora and not Dreambooth? or did u by chance during ur last run save a config file in dreambooth tab and load it in Lora tab ? Please start a new instance and set settings manually, and try again. |
Training process always stops / freezes at epoch1 - cannot train - anybody an idea ? Please help.
here is the Protocol:
To create a public link, set
share=True
inlaunch()
.15:58:52-915530 INFO Loading config...
15:59:22-223452 INFO Start training Dreambooth...
15:59:22-224452 INFO Validating lr scheduler arguments...
15:59:22-225451 INFO Validating optimizer arguments...
15:59:22-226452 INFO Validating D:/Bilder/Project_AI/Train/model existence and writability... SUCCESS
15:59:22-240151 INFO Validating C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors existence... SUCCESS
15:59:22-241152 INFO Validating D:/Bilder/Project_AI/Train/BolliHotShots existence... SUCCESS
15:59:22-242152 INFO Folder 1_BHS Man: 1 repeats found
15:59:22-243152 INFO Folder 1_BHS Man: 51 images found
15:59:22-243152 INFO Folder 1_BHS Man: 51 * 1 = 51 steps
15:59:22-244151 INFO Regulatization factor: 1
15:59:22-244151 INFO Total steps: 51
15:59:22-245152 INFO Train batch size: 1
15:59:22-245152 INFO Gradient accumulation steps: 1
15:59:22-246151 INFO Epoch: 200
15:59:22-247151 INFO max_train_steps (51 / 1 / 1 * 200 * 1) = 10200
15:59:22-249151 INFO lr_warmup_steps = 0
15:59:22-252151 INFO Saving training config to D:/Bilder/Project_AI/Train/model\BolliHotsShots_Latest_Flux_20240907-155922.json...
15:59:22-253153 INFO Executing command: D:\train_flux\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1
--num_cpu_threads_per_process 2 D:/train_flux/kohya_ss/sd-scripts/flux_train.py --config_file D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml
D:\train_flux\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
D:\train_flux\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_fwd")
D:\train_flux\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:344: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch.@torch.library.impl_abstract("xformers_flash::flash_bwd")
D:\train_flux\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning:
torch.utils._pytree._register_pytree_node
is deprecated. Please usetorch.utils._pytree.register_pytree_node
instead.torch.utils._pytree._register_pytree_node(
2024-09-07 15:59:31 INFO Loading settings from D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922.toml... train_util.py:4190
INFO D:/Bilder/Project_AI/Train/model/config_dreambooth-20240907-155922 train_util.py:4209
2024-09-07 15:59:31 INFO Using DreamBooth method. flux_train.py:101
INFO prepare images. train_util.py:1803
INFO get image size from name of cache files train_util.py:1741
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 5614.57it/s]
INFO set image size from cache files: 0/51 train_util.py:1748
INFO found directory D:\Bilder\Project_AI\Train\BolliHotShots\1_BHS Man contains 51 image files train_util.py:1750
INFO 51 train images with repeating. train_util.py:1844
INFO 0 reg images. train_util.py:1847
WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1852
INFO [Dataset 0] config_util.py:570
batch_size: 1
resolution: (1024, 1024)
enable_bucket: True
network_multiplier: 1.0
min_bucket_reso: 256
max_bucket_reso: 2048
bucket_reso_steps: 64
bucket_no_upscale: True
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<00:00, 1826.90it/s]
INFO make buckets train_util.py:882
WARNING min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is set, because bucket reso is defined by image size automatically / train_util.py:899
bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計算されるため、min_bucket_resoとmax_bucket_resoは無視されます
INFO number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む) train_util.py:928
INFO bucket 0: resolution (512, 768), count: 1 train_util.py:933
INFO bucket 1: resolution (768, 1152), count: 1 train_util.py:933
INFO bucket 2: resolution (768, 1216), count: 1 train_util.py:933
INFO bucket 3: resolution (768, 1280), count: 2 train_util.py:933
INFO bucket 4: resolution (832, 1088), count: 3 train_util.py:933
INFO bucket 5: resolution (832, 1152), count: 2 train_util.py:933
INFO bucket 6: resolution (832, 1216), count: 7 train_util.py:933
INFO bucket 7: resolution (896, 1024), count: 3 train_util.py:933
INFO bucket 8: resolution (896, 1088), count: 4 train_util.py:933
INFO bucket 9: resolution (896, 1152), count: 5 train_util.py:933
INFO bucket 10: resolution (960, 960), count: 1 train_util.py:933
INFO bucket 11: resolution (960, 1088), count: 1 train_util.py:933
INFO bucket 12: resolution (1024, 960), count: 2 train_util.py:933
INFO bucket 13: resolution (1024, 1024), count: 1 train_util.py:933
INFO bucket 14: resolution (1088, 832), count: 2 train_util.py:933
INFO bucket 15: resolution (1088, 896), count: 2 train_util.py:933
INFO bucket 16: resolution (1216, 832), count: 13 train_util.py:933
INFO mean ar error (without repeats): 0.01883627399857184 train_util.py:938
INFO prepare accelerator flux_train.py:171
accelerator device: cuda
INFO Building AutoEncoder flux_utils.py:62
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/vae/ae.safetensors flux_utils.py:66
INFO Loaded AE: flux_utils.py:69
2024-09-07 15:59:32 INFO [Dataset 0] train_util.py:2324
INFO caching latents with caching strategy. train_util.py:984
INFO checking cache validity... train_util.py:994
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:00<?, ?it/s]
INFO caching latents... train_util.py:1038
0%| | 0/51 [00:00<?, ?it/s]D:\train_flux\kohya_ss\sd-scripts\library\flux_models.py:79: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
h_ = nn.functional.scaled_dot_product_attention(q, k, v)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:14<00:00, 3.54it/s]
D:\train_flux\kohya_ss\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning:
clean_up_tokenization_spaces
was not set. It will be set toTrue
by default. This behavior will be depracted in transformers v4.45, and will be then set toFalse
by default. For more details check this issue: huggingface/transformers#31884warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the
legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#245652024-09-07 15:59:47 INFO Building CLIP flux_utils.py:74
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/clip_l.safetensors flux_utils.py:167
INFO Loaded CLIP: flux_utils.py:170
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/clip/t5xxl_fp16.safetensors flux_utils.py:215
INFO Loaded T5xxl: flux_utils.py:218
INFO Building Flux model dev flux_utils.py:45
INFO Loading state dict from C:/comfy/ComfyUI_windows_portable/ComfyUI/models/diffusion_models/flux1-dev.safetensors flux_utils.py:52
INFO Loaded Flux: flux_utils.py:55
FLUX: Gradient checkpointing enabled. CPU offload: False
INFO enable block swap: double_blocks_to_swap=0, single_blocks_to_swap=0 flux_train.py:272
number of trainable parameters: 11901408320
prepare optimizer, data loader etc.
INFO use AdamW optimizer | {} train_util.py:4541
running training / 学習開始
num examples / サンプル数: 51
num batches per epoch / 1epochのバッチ数: 51
num epochs / epoch数: 200
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 10200
steps: 0%| | 0/10200 [00:00<?, ?it/s]
epoch 1/200
2024-09-07 16:00:33 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
D:\train_flux\kohya_ss\venv\lib\site-packages\torch\utils\checkpoint.py:1399: FutureWarning:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead.with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
The text was updated successfully, but these errors were encountered: