Smaller flux checkpoints? #128

chris9-0 · 2024-09-24T13:00:09Z

Is it possible to run fluxgym on a smaller model? I don't use fp16 because of having only 8gb VRAM. I tried to put my flux model into the fluxgym models directory, but I receive an error no. 1.

chris9-0 · 2024-09-24T19:41:28Z

So far I am getting an error:

[2024-09-24 21:36:12] [INFO] Running D:\pinokio\api\fluxgym.git\outputs\juna\train.bat
[2024-09-24 21:36:12] [INFO]
[2024-09-24 21:36:12] [INFO] (env) (base) D:\pinokio\api\fluxgym.git>accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 sd-scripts/flux_train_network.py --pretrained_model_name_or_path "D:\pinokio\api\fluxgym.git\models\unet\flux1-dev.sft" --clip_l "D:\pinokio\api\fluxgym.git\models\clip\clip_l.safetensors" --t5xxl "D:\pinokio\api\fluxgym.git\models\clip\t5xxl_fp16.safetensors" --ae "D:\pinokio\api\fluxgym.git\models\vae\ae.sft" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --split_mode --network_args "train_blocks=single" --lr_scheduler constant_with_warmup --max_grad_norm 0.0 --learning_rate 8e-4 --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 8 --save_every_n_epochs 2 --dataset_config "D:\pinokio\api\fluxgym.git\outputs\juna\dataset.toml" --output_dir "D:\pinokio\api\fluxgym.git\outputs\juna" --output_name juna --timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1 --loss_type l2
[2024-09-24 21:36:21] [INFO] The following values were not passed to accelerate launch and had defaults used instead:
[2024-09-24 21:36:21] [INFO] --num_processes was set to a value of 1
[2024-09-24 21:36:21] [INFO] --num_machines was set to a value of 1
[2024-09-24 21:36:21] [INFO] --dynamo_backend was set to a value of 'no'
[2024-09-24 21:36:21] [INFO] To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
[2024-09-24 21:36:27] [INFO] highvram is enabled / highvramが有効です
[2024-09-24 21:36:27] [INFO] 2024-09-24 21:36:27 WARNING cache_latents_to_disk is train_util.py:3951
[2024-09-24 21:36:27] [INFO] enabled, so cache_latents is
[2024-09-24 21:36:27] [INFO] also enabled /
[2024-09-24 21:36:27] [INFO] cache_latents_to_diskが有効なた
[2024-09-24 21:36:27] [INFO] め、cache_latentsを有効にします
[2024-09-24 21:36:27] [INFO] 2024-09-24 21:36:27 INFO t5xxl_max_token_length: flux_train_network.py:155
[2024-09-24 21:36:27] [INFO] 512
[2024-09-24 21:36:28] [INFO] D:\pinokio\api\fluxgym.git\env\lib\site-packages\transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
[2024-09-24 21:36:28] [INFO] warnings.warn(
[2024-09-24 21:36:28] [INFO] You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
[2024-09-24 21:36:28] [INFO] 2024-09-24 21:36:28 INFO Loading dataset config from train_network.py:280
[2024-09-24 21:36:28] [INFO] D:\pinokio\api\fluxgym.git\ou
[2024-09-24 21:36:28] [INFO] tputs\juna\dataset.toml
[2024-09-24 21:36:28] [INFO] INFO prepare images. train_util.py:1808
[2024-09-24 21:36:28] [INFO] INFO get image size from name of train_util.py:1746
[2024-09-24 21:36:28] [INFO] cache files
[2024-09-24 21:36:28] [INFO] 0%| | 0/9 [00:00<?, ?it/s]
100%|██████████| 9/9 [00:00<?, ?it/s]
[2024-09-24 21:36:28] [INFO] INFO set image size from cache train_util.py:1753
[2024-09-24 21:36:28] [INFO] files: 0/9
[2024-09-24 21:36:28] [INFO] INFO found directory train_util.py:1755
[2024-09-24 21:36:28] [INFO] D:\pinokio\api\fluxgym.git\data
[2024-09-24 21:36:28] [INFO] sets\juna contains 9 image
[2024-09-24 21:36:28] [INFO] files
[2024-09-24 21:36:28] [INFO] INFO 90 train images with repeating. train_util.py:1849
[2024-09-24 21:36:28] [INFO] INFO 0 reg images. train_util.py:1852
[2024-09-24 21:36:28] [INFO] WARNING no regularization images / train_util.py:1857
[2024-09-24 21:36:28] [INFO] 正則化画像が見つかりませんでし
[2024-09-24 21:36:28] [INFO] た
[2024-09-24 21:36:28] [INFO] INFO [Dataset 0] config_util.py:570
[2024-09-24 21:36:28] [INFO] batch_size: 1
[2024-09-24 21:36:28] [INFO] resolution: (512, 512)
[2024-09-24 21:36:28] [INFO] enable_bucket: False
[2024-09-24 21:36:28] [INFO] network_multiplier: 1.0
[2024-09-24 21:36:28] [INFO]
[2024-09-24 21:36:28] [INFO] [Subset 0 of Dataset 0]
[2024-09-24 21:36:28] [INFO] image_dir:
[2024-09-24 21:36:28] [INFO] "D:\pinokio\api\fluxgym.git\dat
[2024-09-24 21:36:28] [INFO] asets\juna"
[2024-09-24 21:36:28] [INFO] image_count: 9
[2024-09-24 21:36:28] [INFO] num_repeats: 10
[2024-09-24 21:36:28] [INFO] shuffle_caption: False
[2024-09-24 21:36:28] [INFO] keep_tokens: 1
[2024-09-24 21:36:28] [INFO] keep_tokens_separator:
[2024-09-24 21:36:28] [INFO] caption_separator: ,
[2024-09-24 21:36:28] [INFO] secondary_separator: None
[2024-09-24 21:36:28] [INFO] enable_wildcard: False
[2024-09-24 21:36:28] [INFO] caption_dropout_rate: 0.0
[2024-09-24 21:36:28] [INFO] caption_dropout_every_n_epo
[2024-09-24 21:36:28] [INFO] ches: 0
[2024-09-24 21:36:28] [INFO] caption_tag_dropout_rate:
[2024-09-24 21:36:28] [INFO] 0.0
[2024-09-24 21:36:28] [INFO] caption_prefix: None
[2024-09-24 21:36:28] [INFO] caption_suffix: None
[2024-09-24 21:36:28] [INFO] color_aug: False
[2024-09-24 21:36:28] [INFO] flip_aug: False
[2024-09-24 21:36:28] [INFO] face_crop_aug_range: None
[2024-09-24 21:36:28] [INFO] random_crop: False
[2024-09-24 21:36:28] [INFO] token_warmup_min: 1,
[2024-09-24 21:36:28] [INFO] token_warmup_step: 0,
[2024-09-24 21:36:28] [INFO] alpha_mask: False,
[2024-09-24 21:36:28] [INFO] is_reg: False
[2024-09-24 21:36:28] [INFO] class_tokens: Juna
[2024-09-24 21:36:28] [INFO] caption_extension: .txt
[2024-09-24 21:36:28] [INFO]
[2024-09-24 21:36:28] [INFO]
[2024-09-24 21:36:28] [INFO] INFO [Dataset 0] config_util.py:576
[2024-09-24 21:36:28] [INFO] INFO loading image sizes. train_util.py:881
[2024-09-24 21:36:28] [INFO] 0%| | 0/9 [00:00<?, ?it/s]
100%|██████████| 9/9 [00:00<?, ?it/s]
[2024-09-24 21:36:28] [INFO] INFO prepare dataset train_util.py:889
[2024-09-24 21:36:28] [INFO] INFO preparing accelerator train_network.py:345
[2024-09-24 21:36:28] [INFO] accelerator device: cuda
[2024-09-24 21:36:28] [INFO] INFO Building Flux model dev flux_utils.py:45
[2024-09-24 21:36:29] [INFO] 2024-09-24 21:36:29 INFO Loading state dict from flux_utils.py:52
[2024-09-24 21:36:29] [INFO] D:\pinokio\api\fluxgym.git\models
[2024-09-24 21:36:29] [INFO] \unet\flux1-dev.sft
[2024-09-24 21:36:29] [INFO] INFO Loaded Flux: <All keys matched flux_utils.py:55
[2024-09-24 21:36:29] [INFO] successfully>
[2024-09-24 21:36:29] [INFO] INFO prepare split model flux_train_network.py:110
[2024-09-24 21:36:29] [INFO] INFO load state dict for flux_train_network.py:117
[2024-09-24 21:36:29] [INFO] lower
[2024-09-24 21:36:29] [INFO] INFO load state dict for flux_train_network.py:122
[2024-09-24 21:36:29] [INFO] upper
[2024-09-24 21:36:29] [INFO] INFO prepare upper model flux_train_network.py:125
[2024-09-24 21:37:28] [INFO] Traceback (most recent call last):
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\bin\miniconda\lib\runpy.py", line 196, in _run_module_as_main
[2024-09-24 21:37:28] [INFO] return _run_code(code, main_globals, None,
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\bin\miniconda\lib\runpy.py", line 86, in run_code
[2024-09-24 21:37:28] [INFO] exec(code, run_globals)
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\api\fluxgym.git\env\Scripts\accelerate.exe_main.py", line 7, in
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
[2024-09-24 21:37:28] [INFO] args.func(args)
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
[2024-09-24 21:37:28] [INFO] simple_launcher(args)
[2024-09-24 21:37:28] [INFO] File "D:\pinokio\api\fluxgym.git\env\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
[2024-09-24 21:37:28] [INFO] raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
[2024-09-24 21:37:28] [INFO] subprocess.CalledProcessError: Command '['D:\pinokio\api\fluxgym.git\env\Scripts\python.exe', 'sd-scripts/flux_train_network.py', '--pretrained_model_name_or_path', 'D:\pinokio\api\fluxgym.git\models\unet\flux1-dev.sft', '--clip_l', 'D:\pinokio\api\fluxgym.git\models\clip\clip_l.safetensors', '--t5xxl', 'D:\pinokio\api\fluxgym.git\models\clip\t5xxl_fp16.safetensors', '--ae', 'D:\pinokio\api\fluxgym.git\models\vae\ae.sft', '--cache_latents_to_disk', '--save_model_as', 'safetensors', '--sdpa', '--persistent_data_loader_workers', '--max_data_loader_n_workers', '2', '--seed', '42', '--gradient_checkpointing', '--mixed_precision', 'bf16', '--save_precision', 'bf16', '--network_module', 'networks.lora_flux', '--network_dim', '4', '--optimizer_type', 'adafactor', '--optimizer_args', 'relative_step=False', 'scale_parameter=False', 'warmup_init=False', '--split_mode', '--network_args', 'train_blocks=single', '--lr_scheduler', 'constant_with_warmup', '--max_grad_norm', '0.0', '--learning_rate', '8e-4', '--cache_text_encoder_outputs', '--cache_text_encoder_outputs_to_disk', '--fp8_base', '--highvram', '--max_train_epochs', '8', '--save_every_n_epochs', '2', '--dataset_config', 'D:\pinokio\api\fluxgym.git\outputs\juna\dataset.toml', '--output_dir', 'D:\pinokio\api\fluxgym.git\outputs\juna', '--output_name', 'juna', '--timestep_sampling', 'shift', '--discrete_flow_shift', '3.1582', '--model_prediction_type', 'raw', '--guidance_scale', '1', '--loss_type', 'l2']' returned non-zero exit status 3221225477.
[2024-09-24 21:37:29] [ERROR] Command exited with code 1
[2024-09-24 21:37:29] [INFO] Runner:

AlekseyCalvin · 2024-10-01T19:51:48Z

In your log, it seems like the model keys are matched successfully, but then accelerate errors out when the process tries to proceed with splitting the model (per the '--split_mode' argument). But the real issue probably has to do with GPU overhead anyhow, since before the training even starts, the training scripts try to somehow get the backends to juggle not just the model itself (the transformer/"unet" safetensors) but simultaneously the huge T5XXL text encoder and the smaller Clip text encoder and also the Vae, regardless of whether or not you're trying to train any of these components. And the quantization and the size of those components really matters as well. The fp16 T5XXL is nearly 10GB in its own right. The fp8 version, as used by fluxgym, is nearly 5GB. But when it comes to whether or not the training script even gets to the point of "just" training the model's transformer/"unet" (where slow train on 8GB VRAM may becomes plausible), none of the Flux training scripts or frameworks thus far seem to have sufficiently adaptive cpu offloading mechanisms or other internal workarounds to, say, function half as reliably as the average inference framework. It's been quite maddening actually.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smaller flux checkpoints? #128

Smaller flux checkpoints? #128

chris9-0 commented Sep 24, 2024

chris9-0 commented Sep 24, 2024

AlekseyCalvin commented Oct 1, 2024

Smaller flux checkpoints? #128

Smaller flux checkpoints? #128

Comments

chris9-0 commented Sep 24, 2024

chris9-0 commented Sep 24, 2024

AlekseyCalvin commented Oct 1, 2024