Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train error #44

Open
smacaijicoder opened this issue Dec 31, 2021 · 26 comments
Open

train error #44

smacaijicoder opened this issue Dec 31, 2021 · 26 comments

Comments

@smacaijicoder
Copy link

执行nnFormer_train 3d_fullres nnFormerTrainerV2_Synapse 2 0命令时报错KeyError: 'img0005'
是为什么呢?数据预处理以安装说明执行nnFormer_plan_and_preprocess -t 2
改为nnFormer_train 3d_fullres nnFormerTrainerV2 2 0即可run

@282857341
Copy link
Owner

nnFormerTrainerV2_Synapse 在这里设置了训练集和验证集的key,有问题应该是这里有问题
而nnFormerTrainerV2没有这一项

splits[self.fold]['train']=np.array(['img0006','img0007' ,'img0009', 'img0010', 'img0021' ,'img0023' ,'img0024','img0026' ,'img0027' ,'img0031', 'img0033' ,'img0034' \
,'img0039', 'img0040','img0005', 'img0028', 'img0030', 'img0037'])
splits[self.fold]['val']=np.array(['img0001', 'img0002', 'img0003', 'img0004', 'img0008', 'img0022','img0025', 'img0029', 'img0032', 'img0035', 'img0036', 'img0038'])

@smacaijicoder
Copy link
Author

smacaijicoder commented Dec 31, 2021 via email

@282857341
Copy link
Owner

有可能是因为imagesTr 里没有img0005这一项

@smacaijicoder
Copy link
Author

smacaijicoder commented Dec 31, 2021 via email

@282857341
Copy link
Owner

能发一下报错的完整截图么

@smacaijicoder
Copy link
Author

smacaijicoder commented Dec 31, 2021 via email

@smacaijicoder
Copy link
Author

issue似乎不能显示图片吗?
报错信息如下:
Traceback (most recent call last):
File "/home/yiqingwen/anaconda3/envs/nnFormer/bin/nnFormer_train", line 33, in
sys.exit(load_entry_point('nnformer', 'console_scripts', 'nnFormer_train')())
File "/home/yiqingwen/nnformer/nnFormer/nnformer/run/run_training.py", line 165, in main
trainer.initialize(not validation_only)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 96, in initialize
self.dl_tr, self.dl_val = self.get_basic_generators()
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainer_synapse.py", line 401, in get_basic_generators
self.do_split()
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 339, in do_split
self.dataset_tr[i] = self.dataset[i]
KeyError: 'img0005'

@282857341
Copy link
Owner

看样子应该是 self.dataset 没有img0005这个key,你可以试着把self.dataset 的key都打印出来,正常来说应该会有30个key。
运行nnFormer_plan_and_preprocess 之后会根据imagesTr和labelsTr的内容,生成nnFormer_preprocessed,
self.dateset 会从nnFormerData_plans_v2.1_stage1文件夹里所有的npz文件的名称作为key。

如果self.dataset的key少于30个,说明imagesTr和imagesTr少数据了,正常来说imagesTr和labelsTr里的数据也应该是30个

@smacaijicoder
Copy link
Author

smacaijicoder commented Dec 31, 2021

我将self.dataset打印出来了,下面是一例的内容
('AD_067', OrderedDict([('data_file', '/home/yiqingwen/nnformer/DATASET/nnFormer_preprocessed/Task560_AorticDissection/nnFormerData_plans_v2.1_stage1/AD_067.npz'), ('properties_file', '/home/yiqingwen/nnformer/DATASET/nnFormer_preprocessed/Task560_AorticDissection/nnFormerData_plans_v2.1_stage1/AD_067.pkl'), ('properties', OrderedDict([('original_size_of_raw_data', array([1374, 512, 512])), ('original_spacing', array([1., 1., 1.])), ('list_of_data_files', ['/home/yiqingwen/nnformer/DATASET/nnFormer_raw/nnFormer_raw_data/Task560_AorticDissection/imagesTr/AD_067_0000.nii.gz']), ('seg_file', '/home/yiqingwen/nnformer/DATASET/nnFormer_raw/nnFormer_raw_data/Task560_AorticDissection/labelsTr/AD_067.nii.gz'), ('itk_origin', (0.0, 0.0, 0.0)), ('itk_spacing', (1.0, 1.0, 1.0)), ('itk_direction', (1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0)), ('crop_bbox', [[0, 1374], [0, 512], [0, 512]]), ('classes', array([-1., 0., 1., 2., 3.], dtype=float32)), ('size_after_cropping', (1374, 512, 512)), ('use_nonzero_mask_for_norm', OrderedDict([(0, False)])), ('size_after_resampling', (1374, 512, 512)), ('spacing_after_resampling', array([1., 1., 1.])), ('class_locations', {1: array([[519, 276, 288],
[582, 281, 286],
[245, 275, 305],
...,
[482, 278, 283],
[179, 278, 299],
[303, 287, 274]]), 2: array([[778, 247, 262],
[382, 239, 241],
[859, 251, 248],
...,
[519, 295, 274],
[346, 225, 224],
[680, 272, 257]]), 3: array([[198, 229, 237],
[972, 249, 221],
[945, 248, 227],
...,
[186, 253, 273],
[716, 273, 291],
[164, 230, 246]])})]))]))])
病例是有68例的,报错不会是因为开头的'AD_067'和‘img0005’的key对应不上吧?如果是这样的话,我应该将311-314中的key全部改成‘AD_0XX’的格式吗?

@282857341
Copy link
Owner

是的,如果你要用其他数据的话,就要替换掉

@smacaijicoder
Copy link
Author

感谢!读取的error解决了,但又出现了新的问题。。。报错信息如下
Traceback (most recent call last):
File "/home/yiqingwen/anaconda3/envs/nnFormer/bin/nnFormer_train", line 33, in
sys.exit(load_entry_point('nnformer', 'console_scripts', 'nnFormer_train')())
File "/home/yiqingwen/nnformer/nnFormer/nnformer/run/run_training.py", line 181, in main
trainer.run_training()
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 448, in run_training
ret = super().run_training()
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainer_synapse.py", line 319, in run_training
super(nnFormerTrainer_synapse, self).run_training()
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/network_trainer_synapse.py", line 481, in run_training
l = self.run_iteration(self.tr_gen, True)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 248, in run_iteration
output = self.network(data)
File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 928, in forward
skips = self.model_down(x)
File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 781, in forward
x_out, S, H, W, x, Ws, Wh, Ww = layer(x, Ws, Wh, Ww)
File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 431, in forward
x = blk(x, attn_mask)
File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 219, in forward
assert L == S * H * W, "input feature has wrong size"
AssertionError: input feature has wrong size

L和S H W的值如下:
L 75264
S, H, W 32 32 32
请问这是什么问题呢?

@282857341
Copy link
Owner

如果你的输入的crop size和synapse一致应该就不会有这个问题
crop size可以在这里改

if task=='Task001_ACDC':
plans['plans_per_stage'][0]['batch_size']=4
plans['plans_per_stage'][0]['patch_size']=np.array([14,160,160])
plans['plans_per_stage'][0]['pool_op_kernel_sizes']=[[1, 2, 2], [1, 2, 2], [2, 2, 2], [2, 2, 2]]
plans['plans_per_stage'][0]['conv_kernel_sizes']=[[3,3,3],[3,3,3],[3,3,3],[3,3,3],[3,3,3]]
pickle_file = open(plans_file,'wb')
pickle.dump(plans, pickle_file)
pickle_file.close()
# I downsample the data four times in synapse but twice (z axis) in the ACDC
# if you want to design a new way, you should reassign the value of pool_op_kernel_sizes
# 2 represents downsample and 1 for not downsample, each list in the pool_op_kernel_sizes represents the stage
# when you change the pool_op_kernel_sizes,make sure change the code in the network
# conv_kernel_sizes is not important
elif task=='Task002_Synapse':
plans['plans_per_stage'][1]['batch_size']=2
plans['plans_per_stage'][1]['patch_size']=np.array([64,128,128])
plans['plans_per_stage'][1]['pool_op_kernel_sizes']=[[2,2,2],[2,2,2],[2,2,2],[2,2,2]]
plans['plans_per_stage'][1]['conv_kernel_sizes']=[[3,3,3],[3,3,3],[3,3,3],[3,3,3],[3,3,3]]
pickle_file = open(plans_file,'wb')
pickle.dump(plans, pickle_file)
pickle_file.close()

@smacaijicoder
Copy link
Author

很抱歉再次打扰了,predict时再次遇到了问题,看了很久还是不知道应该修改哪里,所用的imagesTs和imageTr都是自己的数据集,但可以跑nnunet的代码
Traceback (most recent call last):
File "/home/yiqingwen/anaconda3/envs/nnFormer/bin/nnFormer_predict", line 33, in
sys.exit(load_entry_point('nnformer', 'console_scripts', 'nnFormer_predict')())
File "/home/yiqingwen/nnformer/nnFormer/nnformer/inference/predict_simple.py", line 229, in main
step_size=step_size, checkpoint_name=args.chk)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/inference/predict.py", line 637, in predict_from_folder
segmentation_export_kwargs=segmentation_export_kwargs)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/inference/predict.py", line 221, in predict_cases
mixed_precision=mixed_precision)[1][None])
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 221, in predict_preprocessed_data_return_seg_and_softmax
mixed_precision=mixed_precision)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainer_synapse.py", line 524, in predict_preprocessed_data_return_seg_and_softmax
mixed_precision=mixed_precision)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/neural_network.py", line 150, in predict_3D
verbose=verbose)
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/neural_network.py", line 387, in _internal_predict_3D_3Dconv_tiled
gaussian_importance_map)[0]
File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/neural_network.py", line 525, in _internal_maybe_mirror_and_pred_3D
result_torch += 1 / num_results * pred
RuntimeError: The size of tensor a (4) must match the size of tensor b (14) at non-singleton dimension 1
其次是训练时间的问题,nnFormer经我的测试比nnUnet快了将近4倍,但每个epoch还需要大约6分钟,不知道这是不是因为我没有按nnUnet作者所言从源码编译pytorch,我所使用的是单张RTX 2080

@282857341
Copy link
Owner

每个epoch的iteration数目是固定的,可以通过减少epoch,或者减少每个epoch的num_batches数目来减少训练时间。此外把 self.num_val_batches_per_epoch改成50把,我之前为了复现的目的,把这个值改成了167。这样能快一点。

self.num_batches_per_epoch = 250
self.num_val_batches_per_epoch = 50

然后predict出错的地方,具体的shape能打印出来吗

@282857341
Copy link
Owner

顺便问一下,你的数据加上背景是几类

@smacaijicoder
Copy link
Author

pred type <class 'torch.Tensor'>
pred shape torch.Size([1, 14, 64, 128, 128])
result_torch type <class 'torch.Tensor'>
result_torch shpae torch.Size([1, 4, 64, 128, 128])
我的数据加上背景正好是4类,应该是这个问题了,那我应该修改哪里呢?

@282857341
Copy link
Owner

self.final.append(final_patch_expanding(embed_dim*2**i,num_classes,patch_size=patch_size))

本来这里代码设置输出通道是14,改成num_classes就行了

@smacaijicoder
Copy link
Author

RuntimeError: Error(s) in loading state_dict for swintransformer:
size mismatch for final.0.up.weight: copying a param with shape torch.Size([192, 14, 2, 4, 4]) from checkpoint, the shape in current model is torch.Size([192, 4, 2, 4, 4]).
size mismatch for final.0.up.bias: copying a param with shape torch.Size([14]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for final.1.up.weight: copying a param with shape torch.Size([384, 14, 2, 4, 4]) from checkpoint, the shape in current model is torch.Size([384, 4, 2, 4, 4]).
size mismatch for final.1.up.bias: copying a param with shape torch.Size([14]) from checkpoint, the shape in current model is torch.Size([4]).
size mismatch for final.2.up.weight: copying a param with shape torch.Size([768, 14, 2, 4, 4]) from checkpoint, the shape in current model is torch.Size([768, 4, 2, 4, 4]).
size mismatch for final.2.up.bias: copying a param with shape torch.Size([14]) from checkpoint, the shape in current model is torch.Size([4]).

新的问题出现了,应该是训练时保存模型的shape不对,这个问题修改了上面Swin_Unet_l_gelunorm.py里的代码可以解决吗,不过我应该要重新训练了

@282857341
Copy link
Owner

把trainer加载预训练权重的代码注释就行了。

checkpoint = torch.load("../Pretrained_weight/pretrain_Synapse.model", map_location='cuda')
self.network.load_state_dict(checkpoint['state_dict'])
print('I am using the pre_train weight!!')

主要训练和推理的模型变了导致的这个问题,推理的时候会重新初始化网络,然后加载预训练权重的代码写在了初始化网络的代码里,你重新训练的话这个问题就不会出现了

@smacaijicoder
Copy link
Author

那这样的话我是不是不能在自己的数据集上使用你们的预训练模型了呢?因为我重新训练时发现如果不注释上面三行代码还是会报错。或者我应该自己训一个预训练模型吗

@MOMOANNIE
Copy link

感谢!读取的error解决了,但又出现了新的问题。。。报错信息如下 Traceback (most recent call last): File "/home/yiqingwen/anaconda3/envs/nnFormer/bin/nnFormer_train", line 33, in sys.exit(load_entry_point('nnformer', 'console_scripts', 'nnFormer_train')()) File "/home/yiqingwen/nnformer/nnFormer/nnformer/run/run_training.py", line 181, in main trainer.run_training() File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 448, in run_training ret = super().run_training() File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainer_synapse.py", line 319, in run_training super(nnFormerTrainer_synapse, self).run_training() File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/network_trainer_synapse.py", line 481, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py", line 248, in run_iteration output = self.network(data) File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 928, in forward skips = self.model_down(x) File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 781, in forward x_out, S, H, W, x, Ws, Wh, Ww = layer(x, Ws, Wh, Ww) File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 431, in forward x = blk(x, attn_mask) File "/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py", line 219, in forward assert L == S * H * W, "input feature has wrong size" AssertionError: input feature has wrong size

L和S H W的值如下: L 75264 S, H, W 32 32 32 请问这是什么问题呢?

你好,请问这个问题你是怎么解决的呢?

@smacaijicoder
Copy link
Author

smacaijicoder commented Mar 6, 2023 via email

@congcongwy51
Copy link

感谢!读取的error解决了,但又出现了新的问题。。。报错信息如下 Traceback (the recent call last): 文件 “/home/yiqingwen/anaconda3/envs/nnFormer/bin/nnFormer_train”, line 33, in sys.exit(load_entry_point('nnformer', 'console_scripts', 'nnFormer_train')()) 文件 “/home/yiqingwen/nnformer/nnformer/nnFormer/nnformer/run/run_training.py”,第 181 行,在主 trainer.run_training() 文件 “/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py”,第 448 行,在 run_training ret = super().run_training() 文件“/home/yiqingwen/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainer_synapse.py”,第 319 行,在 run_training super(nnFormerTrainer_synapse, self).run_training() 文件“/home/yiqingwen/nnformer/nnformer/nnFormer/nnformer/training/network_training/network_trainer_synapse.py”,第 481 行,在 run_training l = self.run_iteration(self.tr_gen, True) 文件“/home/yiqingwen/nnformer/nnformer/nnFormer/nnformer/training/network_training/nnFormerTrainerV2_Synapse.py”,第 248 行, 在run_iteration输出 = self.network(data) 文件“/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py”,第 889 行,在 _call_impl result = self.forward(*input, **kwargs) 文件“/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py”,第 928 行,正向跳过 = self.model_down(x) 文件“/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py”, 第 889 行,在 _call_impl result = self.forward(*input, **kwargs) 文件“/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py”,第 781 行,在正向 x_out中,S、H、W、x、Ws、Wh、Ww = layer(x, Ws, Wh, Ww) 文件“/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py”,第 889 行,在 _call_impl result = self.forward(*input, **kwargs) 文件 “/home/yiqingwen/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py”,第 431 行,向前 x = blk(x, attn_mask) 文件 “/home/yiqingwen/anaconda3/envs/nnFormer/lib/python3.6/site-packages/torch/nn/modules/module.py”,第 889 行,在 _call_impl result = self.forward(*input, **kwargs) 文件 “/home/yiqingwen/nnformer/nnformer/nnFormer/nnformer/network_architecture/Swin_Unet_l_gelunorm.py”, 第 219 行,在正向断言 L == S * H * W,“输入特征大小错误” AssertionError:输入特征大小错误
L和S H W的值如下: L 75264 S, H, W 32 32 32 请问这是什么问题呢?

你好,请问这个问题你是怎么解决的呢?

请问您解决了嘛

@smacaijicoder
Copy link
Author

smacaijicoder commented Mar 27, 2024 via email

@congcongwy51
Copy link

splits[self.fold]['train']=np.array(['img0006','img0007' ,'img0009', 'img0010', 'img0021' ,'img0023' ,'img0024','img0026' ,'img0027' ,'img0031', 'img0033' ,'img0034'
,'img0039', 'img0040','img0005', 'img0028', 'img0030', 'img0037'])
splits[self.fold]['val']=np.array(['img0001', 'img0002', 'img0003', 'img0004', 'img0008', 'img0022','img0025', 'img0029', 'img0032', 'img0035', 'img0036', 'img0038']) 我想问问这段代码放的数据是将我得训练集分为train和val吗?不需要加测试集是吗?

@congcongwy51
Copy link

如果你的输入的crop size和synapse一致应该就不会有这个问题 crop size可以在这里改

if task=='Task001_ACDC':
plans['plans_per_stage'][0]['batch_size']=4
plans['plans_per_stage'][0]['patch_size']=np.array([14,160,160])
plans['plans_per_stage'][0]['pool_op_kernel_sizes']=[[1, 2, 2], [1, 2, 2], [2, 2, 2], [2, 2, 2]]
plans['plans_per_stage'][0]['conv_kernel_sizes']=[[3,3,3],[3,3,3],[3,3,3],[3,3,3],[3,3,3]]
pickle_file = open(plans_file,'wb')
pickle.dump(plans, pickle_file)
pickle_file.close()
# I downsample the data four times in synapse but twice (z axis) in the ACDC
# if you want to design a new way, you should reassign the value of pool_op_kernel_sizes
# 2 represents downsample and 1 for not downsample, each list in the pool_op_kernel_sizes represents the stage
# when you change the pool_op_kernel_sizes,make sure change the code in the network
# conv_kernel_sizes is not important
elif task=='Task002_Synapse':
plans['plans_per_stage'][1]['batch_size']=2
plans['plans_per_stage'][1]['patch_size']=np.array([64,128,128])
plans['plans_per_stage'][1]['pool_op_kernel_sizes']=[[2,2,2],[2,2,2],[2,2,2],[2,2,2]]
plans['plans_per_stage'][1]['conv_kernel_sizes']=[[3,3,3],[3,3,3],[3,3,3],[3,3,3],[3,3,3]]
pickle_file = open(plans_file,'wb')
pickle.dump(plans, pickle_file)
pickle_file.close()

我是S, H, W 32 32 32,我应该怎么修改呀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants