Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returned non-zero exit status 3221225477 - can't train lora #2800

Open
Telllinex opened this issue Sep 10, 2024 · 12 comments
Open

Returned non-zero exit status 3221225477 - can't train lora #2800

Telllinex opened this issue Sep 10, 2024 · 12 comments

Comments

@Telllinex
Copy link

Telllinex commented Sep 10, 2024

Hello, here is the console logs

12:21:36-202423 INFO     Kohya_ss GUI version: v24.2.0

12:21:37-517593 INFO     Submodule initialized and updated.
12:21:37-526597 INFO     nVidia toolkit detected
12:21:45-574279 INFO     Torch 2.4.0+cu124
12:21:45-634881 INFO     Torch backend: nVidia CUDA 12.4 cuDNN 90100
12:21:45-641875 INFO     Torch detected GPU: GRID RTX6000-12 VRAM 12288 Arch (7, 5) Cores 72
12:21:45-649874 INFO     Python version is 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit
                         (AMD64)]
12:21:45-660869 INFO     Verifying modules installation status from requirements_pytorch_windows.txt...
12:21:45-673867 INFO     Verifying modules installation status from requirements_windows.txt...
12:21:45-683872 INFO     Verifying modules installation status from requirements.txt...
12:22:10-436136 INFO     headless: False
12:22:10-550726 INFO     Using shell=True when running external commands...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
12:23:15-507835 INFO     Loading config...
12:23:22-742464 INFO     Loading config...
12:23:23-339989 INFO     Loading config...
12:23:34-182031 INFO     Loading config...
12:24:27-117548 INFO     Applying preset flux1D - adamw8bit fp8...
12:24:27-136546 INFO     Loading config...
12:25:56-303163 INFO     Save...
12:26:03-950586 INFO     Save...
12:27:47-307918 INFO     Start training LoRA Flux1 ...
12:27:47-309922 INFO     Validating lr scheduler arguments...
12:27:47-311917 INFO     Validating optimizer arguments...
12:27:47-313916 INFO     Validating lora type is Flux1 if flux1 checkbox is checked...
12:27:47-315922 INFO     Validating ./test/logs-saruman existence and writability... SUCCESS
12:27:47-317917 INFO     Validating D:/dataset/na4tal7n/Output existence and writability... SUCCESS
12:27:47-320919 INFO     Validating D:/flux1-dev.safetensors existence... SUCCESS
12:27:47-323919 INFO     Validating D:/dataset/na4tal7n/images existence... SUCCESS
12:27:47-326915 INFO     Folder 2_na4tal7n: 2 repeats found
12:27:47-328925 INFO     Folder 2_na4tal7n: 19 images found
12:27:47-331918 INFO     Folder 2_na4tal7n: 19 * 2 = 38 steps
12:27:47-334914 INFO     Error: 'sample' does not contain an underscore, skipping...
12:27:47-337914 INFO     Regulatization factor: 1
12:27:47-340911 INFO     Total steps: 38
12:27:47-342912 INFO     Train batch size: 1
12:27:47-345919 INFO     Gradient accumulation steps: 1
12:27:47-347911 INFO     Epoch: 1
12:27:47-350915 INFO     Max train steps: 1000
12:27:47-353913 INFO     stop_text_encoder_training = 0
12:27:47-356914 INFO     lr_warmup_steps = 0
12:27:47-360912 WARNING  train_blocks is currently set to 'all'. split_mode is enabled, forcing train_blocks to
                         'single'.
12:27:47-363915 INFO     Saving training config to
                         D:/dataset/na4tal7n/Output\Flux.my-super-duper-model-name-goes-here-v1.0_20240910-122747.json..
                         .
12:27:47-366914 INFO     Executing command: D:\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no
                         --dynamo_mode default --mixed_precision fp16 --num_processes 1 --num_machines 1
                         --num_cpu_threads_per_process 2 D:/kohya_ss/sd-scripts/flux_train_network.py --config_file
                         D:/dataset/na4tal7n/Output/config_lora-20240910-122747.toml
D:\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
D:\kohya_ss\venv\lib\site-packages\diffusers\utils\outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
  torch.utils._pytree._register_pytree_node(
2024-09-10 12:28:06 INFO     Loading settings from                                                    train_util.py:4190
                             D:/dataset/na4tal7n/Output/config_lora-20240910-122747.toml...
                    INFO     D:/dataset/na4tal7n/Output/config_lora-20240910-122747                   train_util.py:4209
2024-09-10 12:28:06 INFO     t5xxl_max_token_length: 512                                       flux_train_network.py:155
D:\kohya_ss\venv\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
2024-09-10 12:28:07 INFO     Using DreamBooth method.                                               train_network.py:291
                    WARNING  ignore directory without repeats /                                       config_util.py:589
                             繰り返し回数のないディレクトリを無視します: sample
                    INFO     prepare images.                                                          train_util.py:1803
                    INFO     get image size from name of cache files                                  train_util.py:1741
100%|████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 3801.18it/s]
                    INFO     set image size from cache files: 0/19                                    train_util.py:1748
                    INFO     found directory D:\dataset\na4tal7n\images\2_na4tal7n contains 19 image  train_util.py:1750
                             files
                    INFO     38 train images with repeating.                                          train_util.py:1844
                    INFO     0 reg images.                                                            train_util.py:1847
                    WARNING  no regularization images / 正則化画像が見つかりませんでした              train_util.py:1852
                    INFO     [Dataset 0]                                                              config_util.py:570
                               batch_size: 1
                               resolution: (512, 512)
                               enable_bucket: True
                               network_multiplier: 1.0
                               min_bucket_reso: 256
                               max_bucket_reso: 2048
                               bucket_reso_steps: 64
                               bucket_no_upscale: True

                               [Subset 0 of Dataset 0]
                                 image_dir: "D:\dataset\na4tal7n\images\2_na4tal7n"
                                 image_count: 19
                                 num_repeats: 2
                                 shuffle_caption: False
                                 keep_tokens: 0
                                 keep_tokens_separator:
                                 caption_separator: ,
                                 secondary_separator: None
                                 enable_wildcard: False
                                 caption_dropout_rate: 0.0
                                 caption_dropout_every_n_epoches: 0
                                 caption_tag_dropout_rate: 0.0
                                 caption_prefix: None
                                 caption_suffix: None
                                 color_aug: False
                                 flip_aug: False
                                 face_crop_aug_range: None
                                 random_crop: False
                                 token_warmup_min: 1,
                                 token_warmup_step: 0,
                                 alpha_mask: False,
                                 is_reg: False
                                 class_tokens: na4tal7n
                                 caption_extension: .txt


                    INFO     [Dataset 0]                                                              config_util.py:576
                    INFO     loading image sizes.                                                      train_util.py:876
100%|█████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 175.94it/s]
2024-09-10 12:28:08 INFO     make buckets                                                              train_util.py:882
                    WARNING  min_bucket_reso and max_bucket_reso are ignored if bucket_no_upscale is   train_util.py:899
                             set, because bucket reso is defined by image size automatically /
                             bucket_no_upscaleが指定された場合は、bucketの解像度は画像サイズから自動計
                             算されるため、min_bucket_resoとmax_bucket_resoは無視されます
                    INFO     number of images (including repeats) /                                    train_util.py:928
                             各bucketの画像枚数(繰り返し回数を含む)
                    INFO     bucket 0: resolution (384, 512), count: 2                                 train_util.py:933
                    INFO     bucket 1: resolution (384, 576), count: 8                                 train_util.py:933
                    INFO     bucket 2: resolution (576, 384), count: 28                                train_util.py:933
                    INFO     mean ar error (without repeats): 0.0                                      train_util.py:938
                    INFO     preparing accelerator                                                  train_network.py:345
D:\kohya_ss\venv\lib\site-packages\accelerate\accelerator.py:488: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)
accelerator device: cuda
                    INFO     Building Flux model dev                                                    flux_utils.py:45
                    INFO     Loading state dict from D:/flux1-dev.safetensors                           flux_utils.py:52
                    INFO     Loaded Flux: <All keys matched successfully>                               flux_utils.py:55
                    INFO     prepare split model                                               flux_train_network.py:110
                    INFO     load state dict for lower                                         flux_train_network.py:117
2024-09-10 12:28:53 INFO     load state dict for upper                                         flux_train_network.py:122
2024-09-10 12:28:54 INFO     prepare upper model                                               flux_train_network.py:125
Traceback (most recent call last):
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\user\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in <module>
  File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
    args.func(args)
  File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "D:\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\\kohya_ss\\venv\\Scripts\\python.exe', 'D:/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'D:/dataset/na4tal7n/Output/config_lora-20240910-122747.toml']' returned non-zero exit status 3221225477.
12:30:38-453385 INFO     Training has ended.

Maybe i can provide detailed logs to identify what's wrong? this gpu does not have support for bf16 so i disabled it switching to fp16, but seems the problem is elsewhere - I have 16 gb ram, pagefile is 32gb, 12gb vram

@iamrohitanshu
Copy link
Contributor

I have a 3060 12GB, I trained a lora just the day before yesterday, same config doesn't work today after I updated yesterday.
I had 41afd2662ff11e70630066bfbc6101a3272420ab commit before updating.

@Telllinex
Copy link
Author

I have a 3060 12GB, I trained a lora just the day before yesterday, same config doesn't work today after I updated yesterday. I had 41afd2662ff11e70630066bfbc6101a3272420ab commit before updating.

So you have the same error? have you tried going back to 41afd26?

@iamrohitanshu
Copy link
Contributor

iamrohitanshu commented Sep 11, 2024

Yes, same error of "returned non-zero exit status 3221225477".
Honestly, I didn't want to go back because of a previous miserable experience, and I wasn't in hurry and hoped it would be resolved soon.
But I still decided to checkout 41afd26
It rolled back gradio to 4.41.0 from 4.43.0 on startup and it's working now. The flux lora has started training.

@Telllinex
Copy link
Author

Yes, same error of "returned non-zero exit status 3221225477". Honestly, I didn't want to go back because of a previous miserable experience, and I wasn't in hurry and hoped it would be resolved soon. But I still decided to checkout 41afd26 It rolled back gradio to 4.41.0 from 4.43.0 on startup and it's working now. The flux lora has started training.

Thanks, i’ll try it
But i would still have bad results, i never understood all those parameters :)

@iamrohitanshu
Copy link
Contributor

iamrohitanshu commented Sep 11, 2024

@Telllinex This might help:
https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters

Also, why did you close this issue as completed? The issue is still there, we just reverted back to a previous commit! right?

@Telllinex
Copy link
Author

@Telllinex This might help: https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters

Also, why did you close this issue as completed? The issue is still there, we just reverted back to a previous commit! right?

I just think no one cares about it because it works at last commit for everyone, but ok, i’ll reopen :)

@Telllinex Telllinex reopened this Sep 11, 2024
@Telllinex
Copy link
Author

Telllinex commented Sep 11, 2024

@Telllinex This might help: https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters

Also, why did you close this issue as completed? The issue is still there, we just reverted back to a previous commit! right?

Tried it again after swithcing to 41afd26 and still same error - maybe i need to increase pagefile size?
And by any chance you can share your config for 3060 12gb Vram? Thank you a lot for help!

@iamrohitanshu
Copy link
Contributor

I don't know about this particular error, but 16GB RAM seems too low to train Flux Lora, it takes up a lot of memory.
Also, try restarting your PC, I was just having OOM error again and again for no reason, after restarting PC it went away. So, try that as well.
Anyway, In the Parameters tab:

  • choose Lora Type as Flux1
  • Resolution of 512x512
  • check 'Split Mode'
  • Train Blocks choose 'single'
  • Keep Network Rank low, such as 16 or 8.
    In the Advanced tab, in Additional Parameters type "--network_train_unet_only".
    These are the things that'll keep VRAM usage minimal as far as I know.

@Telllinex
Copy link
Author

Telllinex commented Sep 11, 2024

I don't know about this particular error, but 16GB RAM seems too low to train Flux Lora, it takes up a lot of memory. Also, try restarting your PC, I was just having OOM error again and again for no reason, after restarting PC it went away. So, try that as well. Anyway, In the Parameters tab:

  • choose Lora Type as Flux1
  • Resolution of 512x512
  • check 'Split Mode'
  • Train Blocks choose 'single'
  • Keep Network Rank low, such as 16 or 8.
    In the Advanced tab, in Additional Parameters type "--network_train_unet_only".
    These are the things that'll keep VRAM usage minimal as far as I know.

Thank you so much, after adding your parameters it finally started training and using 8.4 gb vram
image

Previously i tried training in OneTrainer but LoRa didn't work at all, despite using right parameters )

@iamrohitanshu
Copy link
Contributor

See, I get around 10 something s/it with batch size of 1, on my 3060. So, your lack of RAM is definitely worsening your speed.

Make the batch size 2, it'll decrease the speed (s/it) a bit, but total steps will be halved, so overall it'll be faster.

@Telllinex
Copy link
Author

See, I get around 10 something s/it with batch size of 1, on my 3060. So, your lack of RAM is definitely worsening your speed.

Make the batch size 2, it'll decrease the speed (s/it) a bit, but total steps will be halved, so overall it'll be faster.

I am just happy i can train it on VM i have for free without loading my pc and paying electricity bill, it has 1/2 of rtx6000grid (turing) 12gb vram and 16 gig ram, as you know
Yes, the speed is not the same in training but in generating images speed is the same as 3060 and even better
Also you said it's lack of ram but.... its half of training and gpu vram usage jumps from 8,5 to 4 gb and ram is vary from 9.1 to 14 gigs, so i can't understand how it's limited by ram or vram... maybe if i had your config i could test it, or maybe it's because i switched to fp16 cs bf16 is not supported by turing... but yeah i am checking results and they are so incredible even on 500 steps

@iamrohitanshu
Copy link
Contributor

Yes, the the VRAM usage keeps going up and down. I don't know the reason.
And if your RAM is not spilling over then that's good. I only mentioned speed as it was 12 GB and I assumed it might be a 3060. The speed difference just might be the different architectures.
Flux gives good results even around 500 steps for photorealistic concepts. All the best for your training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants