zipped ImageNet processing scripts #192

gongjingcs · 2022-04-18T07:31:48Z

Hi, can you provide processing scripts for zipped ImageNet ?

OxInsky · 2022-05-13T14:30:32Z

+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.

Conditions:
TAG: default
TEST:
CROP: true
SEQUENTIAL: false
THROUGHPUT_MODE: false
TRAIN:
ACCUMULATION_STEPS: 0
AUTO_RESUME: true
BASE_LR: 0.0004375
CLIP_GRAD: 5.0
EPOCHS: 300
LR_SCHEDULER:
DECAY_EPOCHS: 30
DECAY_RATE: 0.1
NAME: cosine
MIN_LR: 4.3750000000000005e-06
OPTIMIZER:
BETAS:
- 0.9
- 0.999
EPS: 1.0e-08
MOMENTUM: 0.9
NAME: adamw
START_EPOCH: 0
USE_CHECKPOINT: false
WARMUP_EPOCHS: 20
WARMUP_LR: 4.375e-07
WEIGHT_DECAY: 0.05

global_rank 6 cached 0/1281167 takes 0.00s per block
global_rank 3 cached 0/1281167 takes 0.00s per block
global_rank 5 cached 0/1281167 takes 0.00s per block
global_rank 2 cached 0/1281167 takes 0.00s per block
global_rank 0 cached 0/1281167 takes 0.00s per block
global_rank 7 cached 0/1281167 takes 0.00s per block
global_rank 1 cached 0/1281167 takes 0.00s per block
global_rank 4 cached 0/1281167 takes 0.00s per block
global_rank 6 cached 128116/1281167 takes 52.54s per block
global_rank 5 cached 128116/1281167 takes 52.40s per block
global_rank 4 cached 128116/1281167 takes 51.70s per block
global_rank 7 cached 128116/1281167 takes 52.25s per block
global_rank 0 cached 128116/1281167 takes 52.32s per block
global_rank 2 cached 128116/1281167 takes 52.33s per block
global_rank 3 cached 128116/1281167 takes 52.48s per block
global_rank 1 cached 128116/1281167 takes 52.20s per block
global_rank 0 cached 256232/1281167 takes 25.78s per block
global_rank 7 cached 256232/1281167 takes 25.78s per block
global_rank 6 cached 256232/1281167 takes 25.78s per block
global_rank 3 cached 256232/1281167 takes 25.78s per block
global_rank 4 cached 256232/1281167 takes 25.78s per block
global_rank 5 cached 256232/1281167 takes 25.78s per block
global_rank 2 cached 256232/1281167 takes 25.78s per block
global_rank 1 cached 256232/1281167 takes 25.78s per block``

it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.

The matter as follows:
global_rank 3 cached 640580/1281167 takes 28.99s per block
global_rank 0 cached 768696/1281167 takes 27.38s per block
global_rank 1 cached 768696/1281167 takes 27.37s per block
global_rank 2 cached 768696/1281167 takes 27.38s per block
global_rank 4 cached 768696/1281167 takes 27.38s per block
global_rank 7 cached 768696/1281167 takes 27.38s per block
global_rank 5 cached 768696/1281167 takes 27.38s per block
global_rank 6 cached 768696/1281167 takes 27.38s per block
global_rank 2 cached 896812/1281167 takes 27.98s per block
global_rank 6 cached 896812/1281167 takes 27.98s per block
global_rank 7 cached 896812/1281167 takes 27.98s per block
global_rank 4 cached 896812/1281167 takes 27.98s per block
global_rank 0 cached 896812/1281167 takes 27.99s per block
global_rank 1 cached 896812/1281167 takes 27.99s per block
Traceback (most recent call last):
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/swin/bin/python', '-u', 'main.py', '--local_rank=7', '--cfg', 'configs/swin_small_patch4_window7_224.yaml', '--output=/root/tuantuan1/model/swin/', '--zip', '--cache-mode', 'part', '--data-path', '/root/tuantuan1/data/ImageNet-Zip', '--batch-size', '56']' died with <Signals.SIGKILL: 9>.
(swin) root@a41cbab8ac5e:~/workspace/Swin-Transformer# global_rank 2 cached 1024928/1281167 takes 32.62s per block
global_rank 7 cached 1024928/1281167 takes 32.62s per block
global_rank 1 cached 1024928/1281167 takes 32.61s per block
global_rank 4 cached 1024928/1281167 takes 32.63s per block
global_rank 6 cached 1024928/1281167 takes 32.63s per block
global_rank 4 cached 1153044/1281167 takes 31.60s per block

run command follows:

[Fri May 13 22:25:33 2022] [Fri May 13 22:25:33 2022] [ pid ] [Fri May 13 22:25:33 2022] [56913] [Fri May 13 22:25:33 2022] [56985] [Fri May 13 22:25:33 2022] [ 538] [Fri May 13 22:25:33 2022] [ 540] [Fri May 13 22:25:33 2022] [50918] [Fri May 13 22:25:33 2022] [29230] [Fri May 13 22:25:33 2022] [29337] [Fri May 13 22:25:33 2022] [52248] [Fri May 13 22:25:33 2022] [52268] [Fri May 13 22:25:33 2022] [21352] [Fri May 13 22:25:33 2022] [25338] [Fri May 13 22:25:33 2022] [17019] [Fri May 13 22:25:33 2022] [17274] [Fri May 13 22:25:33 2022] [ 2912] [Fri May 13 22:25:33 2022] [ 2926] [Fri May 13 22:25:33 2022] [30026] [Fri May 13 22:25:33 2022] [30027] [Fri May 13 22:25:33 2022] [30029] [Fri May 13 22:25:33 2022] [30031] [Fri May 13 22:25:33 2022] [30032] [Fri May 13 22:25:33 2022] [44777] [Fri May 13 22:25:33 2022] [Fri May 13 22:25:33 2022] dmesg -T | grep -E -i -B100 'killed process'
Memory cgroup stats for /docker/a41cbab8ac5e0d3680e664e26bcc8890070bf4607a0744f00051fcc046a9ca1b: cache:345232KB rss:104203124KB rss_huge:0KB shmem:342340KB mapped_file:344784KB dirty:264KB writeback:0KB inactive_anon:62144KB active_anon:104487672KB inactive_file:1508KB active_file:40KB unevictable:0KB
uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
0 56913 271 1 32768 0 1000 docker-init
0 56985 4540 781 81920 0 1000 bash
0 538 4540 514 73728 0 1000 bash
0 540 1094 162 57344 0 1000 sleep
0 50918 16378 984 163840 0 1000 sshd
0 29230 23231 1686 212992 0 1000 sshd
0 29337 3220 485 69632 0 1000 sftp-server
0 52248 23235 1731 217088 0 1000 sshd
0 52268 4621 887 77824 0 1000 bash
0 21352 12791 2062 131072 0 1000 vim
0 25338 12791 2081 135168 0 1000 vim
0 17019 2404 654 65536 0 1000 bash
0 17274 2404 631 65536 0 1000 bash
0 2912 23199 1720 225280 0 1000 sshd
0 2926 4611 875 77824 0 1000 bash
0 30026 14506230 5336525 44224512 0 1000 python
0 30027 14499677 5329772 44138496 0 1000 python
0 30029 14553669 5350753 44294144 0 1000 python
0 30031 14511896 5342122 44257280 0 1000 python
0 30032 14511890 5342005 44220416 0 1000 python
0 44777 1094 167 53248 0 1000 sleep
Memory cgroup out of memory: Kill process 30029 (python) score 1198 or sacrifice child
Killed process 30029 (python) total-vm:58214676kB, anon-rss:20876164kB, file-rss:439200kB, shmem-rss:87648kB

My memory information as follow:(swap is equal the memory, but it was limited.)

root@a41cbab8ac5e:# free -g
total used free shared buff/cache available
Mem: 251 91 123 0 36 157
Swap: 0 0 0
root@a41cbab8ac5e:#

OxInsky · 2022-05-13T14:37:18Z

please help me, very thanks!

OxInsky · 2022-05-13T15:39:59Z

I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works！ maybe i should allocate more memory . Try it later.

[2022-05-13 23:36:32 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][860/6672] eta 0:47:01 lr 0.000421 time 0.5162 (0.4855) loss 5.1306 (4.9953) grad_norm 2.3225 (2.8221) mem 7785MB
[2022-05-13 23:36:37 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][870/6672] eta 0:46:57 lr 0.000421 time 0.4454 (0.4856) loss 5.0256 (4.9957) grad_norm 3.3165 (2.8220) mem 7785MB
[2022-05-13 23:36:42 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][880/6672] eta 0:46:51 lr 0.000421 time 0.4642 (0.4855) loss 5.5207 (4.9914) grad_norm 2.4980 (2.8192) mem 7785MB
[2022-05-13 23:36:46 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][890/6672] eta 0:46:46 lr 0.000421 time 0.4609 (0.4853) loss 3.8626 (4.9899) grad_norm 2.3808 (2.8147) mem 7785MB
[2022-05-13 23:36:51 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][900/6672] eta 0:46:40 lr 0.000421 time 0.4697 (0.4851) loss 5.3830 (4.9896) grad_norm 2.3411 (2.8122) mem 7785MB
[2022-05-13 23:36:56 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][910/6672] eta 0:46:35 lr 0.000421 time 0.4822 (0.4852) loss 3.9433 (4.9898) grad_norm 2.6711 (2.8111) mem 7785MB
[2022-05-13 23:37:01 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][920/6672] eta 0:46:31 lr 0.000421 time 0.4837 (0.4852) loss 5.8288 (4.9927) grad_norm 2.7145 (2.8098) mem 7785MB
[2022-05-13 23:37:06 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][930/6672] eta 0:46:26 lr 0.000421 time 0.4757 (0.4852) loss 5.0991 (4.9949) grad_norm 2.9705 (2.8072) mem 7785MB
[2022-05-13 23:37:10 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][940/6672] eta 0:46:20 lr 0.000421 time 0.4732 (0.4850) loss 5.0595 (4.9934) grad_norm 3.3814 (2.8068) mem 7785MB
[2022-05-13 23:37:15 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][950/6672] eta 0:46:14 lr 0.000421 time 0.4738 (0.4849) loss 3.7327 (4.9873) grad_norm 2.5955 (2.8050) mem 7785MB
[2022-05-13 23:37:20 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][960/6672] eta 0:46:08 lr 0.000421 time 0.4729 (0.4847) loss 5.1887 (4.9857) grad_norm 2.2804 (2.8031) mem 7785MB
[2022-05-13 23:37:25 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][970/6672] eta 0:46:03 lr 0.000421 time 0.4771 (0.4846) loss 4.8851 (4.9864) grad_norm 3.8150 (2.8028) mem 7785MB
[2022-05-13 23:37:29 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][980/6672] eta 0:45:57 lr 0.000421 time 0.4817 (0.4845) loss 5.2875 (4.9864) grad_norm 2.5579 (2.8007) mem 7785MB
[2022-05-13 23:37:34 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][990/6672] eta 0:45:52 lr 0.000421 time 0.4870 (0.4844) loss 5.1356 (4.9867) grad_norm 2.7010 (2.8004) mem 7785MB
[2022-05-13 23:37:39 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1000/6672] eta 0:45:46 lr 0.000421 time 0.4722 (0.4843) loss 4.7810 (4.9856) grad_norm 3.4885 (2.8016) mem 7785MB
[2022-05-13 23:37:43 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1010/6672] eta 0:45:41 lr 0.000421 time 0.4764 (0.4841) loss 5.5549 (4.9877) grad_norm 2.4293 (2.7994) mem 7785MB
[2022-05-13 23:37:48 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1020/6672] eta 0:45:35 lr 0.000421 time 0.4646 (0.4840) loss 5.3800 (4.9887) grad_norm 2.8842 (2.7980) mem 7785MB
[2022-05-13 23:37:53 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1030/6672] eta 0:45:30 lr 0.000421 time 0.4699 (0.4839) loss 5.5460 (4.9855) grad_norm 2.5153 (2.7954) mem 7785MB

Memory (4 GPUs with bs_48, i think it has nothing with bs)

root:/workspace/Swin-Transformer# free -g
total used free shared buff/cache available
Mem: 251 181 24 2 45 65
Swap: 0 0 0
root/workspace/Swin-Transformer#

ZJLi2013 · 2022-06-06T01:58:05Z

wonder how to generate the zipped imageNet labels, e.g. train_map.txt ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zipped ImageNet processing scripts #192

zipped ImageNet processing scripts #192

gongjingcs commented Apr 18, 2022

OxInsky commented May 13, 2022 •

edited

Loading

OxInsky commented May 13, 2022

OxInsky commented May 13, 2022 •

edited

Loading

ZJLi2013 commented Jun 6, 2022

zipped ImageNet processing scripts #192

zipped ImageNet processing scripts #192

Comments

gongjingcs commented Apr 18, 2022

OxInsky commented May 13, 2022 • edited Loading

+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.

it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.

run command follows:

My memory information as follow:(swap is equal the memory, but it was limited.)

OxInsky commented May 13, 2022

OxInsky commented May 13, 2022 • edited Loading

I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works！ maybe i should allocate more memory . Try it later.

Memory (4 GPUs with bs_48, i think it has nothing with bs)

ZJLi2013 commented Jun 6, 2022

OxInsky commented May 13, 2022 •

edited

Loading

OxInsky commented May 13, 2022 •

edited

Loading