Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opensora-plan-v1.1 推理生成的视频都是类似马赛克 #668

Open
guyuliang7 opened this issue Sep 20, 2024 · 10 comments
Open

opensora-plan-v1.1 推理生成的视频都是类似马赛克 #668

guyuliang7 opened this issue Sep 20, 2024 · 10 comments
Assignees

Comments

@guyuliang7
Copy link

mindone 版本:最新版
MindSpore 版本:2.3.1
cann 版本:8.0.RC2 (提供的C18 CANN包也使用过)

examples/opensora_pku/scripts/text_condition/sample_video_221.sh
examples/opensora_pku/scripts/text_condition/sample_video_65.sh

@SamitHuang
Copy link
Collaborator

你好,请问有推理的log信息吗?权重是否已正确加载呢?

@guyuliang7
Copy link
Author

你好,请问有推理的日志信息吗?权重是否正确已加载呢?

/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float64'> type is zero.
return self._float_to_str(self.smallest_subnormal)
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/numpy/core/getlimits.py:499: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
setattr(self, word, getattr(machar, word).flat[0])
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/numpy/core/getlimits.py:89: UserWarning: The value of the smallest subnormal for <class 'numpy.float32'> type is zero.
return self._float_to_str(self.smallest_subnormal)
Warning, cannot find compiled version of RoPE2D, using a slow version instead
Warning, cannot find compiled version of RoPE2D, using a slow version instead
Get New FA API!
[2024-09-24 10:35:31] INFO: Using jit_level: O0
[2024-09-24 10:35:31] INFO: vae init
[WARNING] CORE(234340,ffff9fc2a020,python):2024-09-24-10:35:31.667.049 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.
The config attributes {'in_channels': 3, 'out_channels': 3} were passed to CausalVAEModel, but are not expected and will be ignored. Please verify your config.json configuration file.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.0.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.0.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.11.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.11.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.2.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.3.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.3.num_batches_tracked from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.3.running_mean from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.3.running_var from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.3.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.5.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.6.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.6.num_batches_tracked from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.6.running_mean from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.6.running_var from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.6.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.8.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.9.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.9.num_batches_tracked from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.9.running_mean from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.9.running_var from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.discriminator.main.9.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.logvar from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.lin0.model.1.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.lin1.model.1.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.lin2.model.1.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.lin3.model.1.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.lin4.model.1.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice1.0.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice1.0.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice1.2.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice1.2.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice2.5.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice2.5.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice2.7.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice2.7.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.10.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.10.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.12.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.12.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.14.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice3.14.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.17.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.17.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.19.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.19.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.21.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice4.21.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.24.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.24.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.26.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.26.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.28.bias from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.net.slice5.28.weight from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.scaling_layer.scale from state_dict.
[2024-09-24 10:35:41] INFO: Deleting key loss.perceptual_loss.scaling_layer.shift from state_dict.
[2024-09-24 10:35:41] INFO: Restored from /data/nfs/gyl/work/Open-Sora-Plan-v1.1.0/vae/causalvae_488.ckpt
[2024-09-24 10:35:41] INFO: Use amp level O2 for causal 3D VAE with dtype=Float16, custom_fp32_cells: []
[2024-09-24 10:35:41] INFO: Number of prompts: 19
[2024-09-24 10:35:41] INFO: Number of generated samples for each prompt 1
[2024-09-24 10:35:41] INFO: loading annotations from ./sample_videos/prompt_list_65/dataset.csv ...
[2024-09-24 10:35:41] INFO: Num data samples: 19
[2024-09-24 10:35:45] INFO: Num batches: 19
[2024-09-24 10:35:47] INFO: Latte-65x512x512 init
The config attributes {'attention_mode': 'xformers'} were passed to LatteT2V, but are not expected and will be ignored. Please verify your config.json configuration file.
[2024-09-24 10:36:13] INFO: Restored from ckpt /data/nfs/gyl/work/Open-Sora-Plan-v1.1.0/65x512x512/LatteT2V-65x512x512.ckpt
[WARNING] ME(234340:281473362075680,MainProcess):2024-09-24-10:36:23.575.260 [mindspore/train/serialization.py:1560] For 'load_param_into_net', 2 parameters in the 'net' are not loaded, because they are not in the 'parameter_dict', please check whether the network structure is consistent when training and loading checkpoint.
[WARNING] ME(234340:281473362075680,MainProcess):2024-09-24-10:36:23.575.800 [mindspore/train/serialization.py:1564] ['temp_pos_embed', 'pos_embed.pos_embed'] are not loaded.
net param not load: ['temp_pos_embed', 'pos_embed.pos_embed'] 2
ckpt param not load: [] 0
[2024-09-24 10:36:23] INFO: Set mixed precision to O2 with dtype=fp16
[2024-09-24 10:36:23] INFO: T5 init
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565
/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
[2024-09-24 10:36:41] INFO: Load checkpoint from /data/nfs/gyl/work/opensora/t5-v1_1-xxl/model.ckpt.
[2024-09-24 10:37:34] WARNING: Checkpoint not loaded: ['shared.embedding_table']
[2024-09-24 10:37:34] INFO: Use amp level O2 for text encoder T5 with dtype=Float16
[2024-09-24 10:37:34] INFO: Key Settings:

MindSpore mode[GRAPH(0)/PYNATIVE(1)]: 0
Num of samples: 19
Num params: 1,305,797,131 (latte: 1,058,937,632, vae: 246,859,499)
Num trainable params: 0
Transformer dtype: Float16
VAE dtype: Float16
Text encoder dtype: Float16
Sampling steps 150
Sampling method: DDIM
CFG guidance scale: 7.5
Enable flash attention: True (Float16)

@guyuliang7
Copy link
Author

0%| | 0/19 [00:00<?, ?it/s]start compile Ascend C operator MatMulV3. kernel name is te_matmulv3_12832c7f8fbacbaef0c1a296941370ee5565ee599a1f730337eed8987b8f777d_1
start compile Ascend C operator MatMulV3. kernel name is te_matmulv3_9387668753053eef8001047dc4129b9a32f86c0a75ccb6ae8407a5d692916f74_1 | 0/150 [00:00<?, ?it/s]
compile Ascend C operator: MatMulV3 success!
compile Ascend C operator: MatMulV3 success!
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [04:14<00:00, 1.70s/it]###

@guyuliang7
Copy link
Author

我在跑opensora的时候遇到了同样的问题 在跑opensora-1.0的时候推理的结果是正常的 在opensora-1.1 opensora-1.2的时候都是马赛克

@CaitinZhao
Copy link
Collaborator

[WARNING] CORE(234340,ffff9fc2a020,python):2024-09-24-10:35:31.667.049 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.

从上面日志看好像没有在jit_level O0上执行,切到O2了,检查下是不是有 RANK_TABLE_FILEMINDSPORE_HCCL_CONFIG_PATH 的环境变量,去掉试下

@guyuliang7
Copy link
Author

[WARNING] CORE(234340,ffff9fc2a020,python):2024-09-24-10:35:31.667.049 [mindspore/core/utils/ms_context.cc:531] GetJitLevel] Set jit level to O2 for rank table startup method.

从上面日志看好像没有在jit_level O0上执行,切到O2了,检查下是不是有RANK_TABLE_FILEMINDSPORE_HCCL_CONFIG_PATH的环境变量,去掉试下

查找环境变量存在RANK_TABLE_FILE=/user/config/nbstart_hccl.json 但是去掉之后对结果没有改善

@CaitinZhao
Copy link
Collaborator

还有上面那行日志吗

@guyuliang7
Copy link
Author

还有上面那行日志吗

不存在了

[2024-09-24 14:27:08] INFO: Using jit_level: O0
[2024-09-24 14:27:08] INFO: vae init
The config attributes {'in_channels': 3, 'out_channels': 3} were passed to CausalVAEModel, but are not expected and will be ignored. Please verify your config.json configuration file.

@guyuliang7
Copy link
Author

这个模型权重没有加载到 这是为什么呢
[2024-09-24 15:23:22] INFO: Load checkpoint from /data/nfs/gyl/work/opensora/t5-v1_1-xxl/t5-v1_1-xxl.ckpt.
[2024-09-24 15:24:17] WARNING: Checkpoint not loaded: ['shared.embedding_table']

@wtomin
Copy link
Collaborator

wtomin commented Oct 3, 2024

你好,我已切换到你所提供的镜像环境,执行scripts/text_condition/sample_video_65.sh, 生成视频是正常的,没有复现出你遇到的问题。

但是通过对比推理的log信息,发现有以下不同之处:

  1. mindone/opensora_pku 的默认sampling_method是PNDM, 你的推理log显示你使用的方法是DDIM;
  2. Transformer, VAE 和 T5 text encoder的默认data type是BF16, 你的推理log显示, 这三个模型使用的都是FP16;
  3. 通过scripts/model_conversion/convert_all.sh得到的T5 权重名称是t5-v1_1-xxl.ckpt,你的推理log显示你使用的权重是model.ckpt. 请注意,权重转换脚本有更新过,你使用的model.ckpt很可能是旧的权重转换脚本得到的ckpt.

综合来看,你遇到的问题应该不是环境导致的,而是代码版本问题。我使用的代码commit id 是2a7adcf0 (Sep 24). 你可以参考这个commit id, 或者直接更新到最新的master 代码。另外,请重新运行权重转换脚本convert_all.sh。

如有问题,请及时联络。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants