Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zipped ImageNet processing scripts #192

Open
gongjingcs opened this issue Apr 18, 2022 · 4 comments
Open

zipped ImageNet processing scripts #192

gongjingcs opened this issue Apr 18, 2022 · 4 comments

Comments

@gongjingcs
Copy link

Hi, can you provide processing scripts for zipped ImageNet ?

@OxInsky
Copy link

OxInsky commented May 13, 2022

+1, I found that it need more 100G+ memory during the data preparation when i was training the Swin-Transformer。it's unbelieveable. there is some information for this question.

Conditions:
TAG: default
TEST:
CROP: true
SEQUENTIAL: false
THROUGHPUT_MODE: false
TRAIN:
ACCUMULATION_STEPS: 0
AUTO_RESUME: true
BASE_LR: 0.0004375
CLIP_GRAD: 5.0
EPOCHS: 300
LR_SCHEDULER:
DECAY_EPOCHS: 30
DECAY_RATE: 0.1
NAME: cosine
MIN_LR: 4.3750000000000005e-06
OPTIMIZER:
BETAS:
- 0.9
- 0.999
EPS: 1.0e-08
MOMENTUM: 0.9
NAME: adamw
START_EPOCH: 0
USE_CHECKPOINT: false
WARMUP_EPOCHS: 20
WARMUP_LR: 4.375e-07
WEIGHT_DECAY: 0.05

global_rank 6 cached 0/1281167 takes 0.00s per block
global_rank 3 cached 0/1281167 takes 0.00s per block
global_rank 5 cached 0/1281167 takes 0.00s per block
global_rank 2 cached 0/1281167 takes 0.00s per block
global_rank 0 cached 0/1281167 takes 0.00s per block
global_rank 7 cached 0/1281167 takes 0.00s per block
global_rank 1 cached 0/1281167 takes 0.00s per block
global_rank 4 cached 0/1281167 takes 0.00s per block
global_rank 6 cached 128116/1281167 takes 52.54s per block
global_rank 5 cached 128116/1281167 takes 52.40s per block
global_rank 4 cached 128116/1281167 takes 51.70s per block
global_rank 7 cached 128116/1281167 takes 52.25s per block
global_rank 0 cached 128116/1281167 takes 52.32s per block
global_rank 2 cached 128116/1281167 takes 52.33s per block
global_rank 3 cached 128116/1281167 takes 52.48s per block
global_rank 1 cached 128116/1281167 takes 52.20s per block
global_rank 0 cached 256232/1281167 takes 25.78s per block
global_rank 7 cached 256232/1281167 takes 25.78s per block
global_rank 6 cached 256232/1281167 takes 25.78s per block
global_rank 3 cached 256232/1281167 takes 25.78s per block
global_rank 4 cached 256232/1281167 takes 25.78s per block
global_rank 5 cached 256232/1281167 takes 25.78s per block
global_rank 2 cached 256232/1281167 takes 25.78s per block
global_rank 1 cached 256232/1281167 takes 25.78s per block``

it will up to the number of train and val images. should it cache the image data or the list. but my memory is out. So I think it's the image data.

The matter as follows:
global_rank 3 cached 640580/1281167 takes 28.99s per block
global_rank 0 cached 768696/1281167 takes 27.38s per block
global_rank 1 cached 768696/1281167 takes 27.37s per block
global_rank 2 cached 768696/1281167 takes 27.38s per block
global_rank 4 cached 768696/1281167 takes 27.38s per block
global_rank 7 cached 768696/1281167 takes 27.38s per block
global_rank 5 cached 768696/1281167 takes 27.38s per block
global_rank 6 cached 768696/1281167 takes 27.38s per block
global_rank 2 cached 896812/1281167 takes 27.98s per block
global_rank 6 cached 896812/1281167 takes 27.98s per block
global_rank 7 cached 896812/1281167 takes 27.98s per block
global_rank 4 cached 896812/1281167 takes 27.98s per block
global_rank 0 cached 896812/1281167 takes 27.99s per block
global_rank 1 cached 896812/1281167 takes 27.99s per block
Traceback (most recent call last):
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/anaconda3/envs/swin/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/root/anaconda3/envs/swin/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/swin/bin/python', '-u', 'main.py', '--local_rank=7', '--cfg', 'configs/swin_small_patch4_window7_224.yaml', '--output=/root/tuantuan1/model/swin/', '--zip', '--cache-mode', 'part', '--data-path', '/root/tuantuan1/data/ImageNet-Zip', '--batch-size', '56']' died with <Signals.SIGKILL: 9>.
(swin) root@a41cbab8ac5e:~/workspace/Swin-Transformer# global_rank 2 cached 1024928/1281167 takes 32.62s per block
global_rank 7 cached 1024928/1281167 takes 32.62s per block
global_rank 1 cached 1024928/1281167 takes 32.61s per block
global_rank 4 cached 1024928/1281167 takes 32.63s per block
global_rank 6 cached 1024928/1281167 takes 32.63s per block
global_rank 4 cached 1153044/1281167 takes 31.60s per block

run command follows:

dmesg -T | grep -E -i -B100 'killed process'
[Fri May 13 22:25:33 2022] Memory cgroup stats for /docker/a41cbab8ac5e0d3680e664e26bcc8890070bf4607a0744f00051fcc046a9ca1b: cache:345232KB rss:104203124KB rss_huge:0KB shmem:342340KB mapped_file:344784KB dirty:264KB writeback:0KB inactive_anon:62144KB active_anon:104487672KB inactive_file:1508KB active_file:40KB unevictable:0KB
[Fri May 13 22:25:33 2022] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Fri May 13 22:25:33 2022] [56913] 0 56913 271 1 32768 0 1000 docker-init
[Fri May 13 22:25:33 2022] [56985] 0 56985 4540 781 81920 0 1000 bash
[Fri May 13 22:25:33 2022] [ 538] 0 538 4540 514 73728 0 1000 bash
[Fri May 13 22:25:33 2022] [ 540] 0 540 1094 162 57344 0 1000 sleep
[Fri May 13 22:25:33 2022] [50918] 0 50918 16378 984 163840 0 1000 sshd
[Fri May 13 22:25:33 2022] [29230] 0 29230 23231 1686 212992 0 1000 sshd
[Fri May 13 22:25:33 2022] [29337] 0 29337 3220 485 69632 0 1000 sftp-server
[Fri May 13 22:25:33 2022] [52248] 0 52248 23235 1731 217088 0 1000 sshd
[Fri May 13 22:25:33 2022] [52268] 0 52268 4621 887 77824 0 1000 bash
[Fri May 13 22:25:33 2022] [21352] 0 21352 12791 2062 131072 0 1000 vim
[Fri May 13 22:25:33 2022] [25338] 0 25338 12791 2081 135168 0 1000 vim
[Fri May 13 22:25:33 2022] [17019] 0 17019 2404 654 65536 0 1000 bash
[Fri May 13 22:25:33 2022] [17274] 0 17274 2404 631 65536 0 1000 bash
[Fri May 13 22:25:33 2022] [ 2912] 0 2912 23199 1720 225280 0 1000 sshd
[Fri May 13 22:25:33 2022] [ 2926] 0 2926 4611 875 77824 0 1000 bash
[Fri May 13 22:25:33 2022] [30026] 0 30026 14506230 5336525 44224512 0 1000 python
[Fri May 13 22:25:33 2022] [30027] 0 30027 14499677 5329772 44138496 0 1000 python
[Fri May 13 22:25:33 2022] [30029] 0 30029 14553669 5350753 44294144 0 1000 python
[Fri May 13 22:25:33 2022] [30031] 0 30031 14511896 5342122 44257280 0 1000 python
[Fri May 13 22:25:33 2022] [30032] 0 30032 14511890 5342005 44220416 0 1000 python
[Fri May 13 22:25:33 2022] [44777] 0 44777 1094 167 53248 0 1000 sleep
[Fri May 13 22:25:33 2022] Memory cgroup out of memory: Kill process 30029 (python) score 1198 or sacrifice child
[Fri May 13 22:25:33 2022] Killed process 30029 (python) total-vm:58214676kB, anon-rss:20876164kB, file-rss:439200kB, shmem-rss:87648kB

My memory information as follow:(swap is equal the memory, but it was limited.)

root@a41cbab8ac5e:# free -g
total used free shared buff/cache available
Mem: 251 91 123 0 36 157
Swap: 0 0 0
root@a41cbab8ac5e:
#

@OxInsky
Copy link

OxInsky commented May 13, 2022

please help me, very thanks!

@OxInsky
Copy link

OxInsky commented May 13, 2022

I succeed! I allocated my memory up to 200G but cannot support 8GPU with bs_56 for training as well. it out of memory. So I set 4GPUs with bs48. it works! maybe i should allocate more memory . Try it later.

[2022-05-13 23:36:32 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][860/6672] eta 0:47:01 lr 0.000421 time 0.5162 (0.4855) loss 5.1306 (4.9953) grad_norm 2.3225 (2.8221) mem 7785MB
[2022-05-13 23:36:37 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][870/6672] eta 0:46:57 lr 0.000421 time 0.4454 (0.4856) loss 5.0256 (4.9957) grad_norm 3.3165 (2.8220) mem 7785MB
[2022-05-13 23:36:42 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][880/6672] eta 0:46:51 lr 0.000421 time 0.4642 (0.4855) loss 5.5207 (4.9914) grad_norm 2.4980 (2.8192) mem 7785MB
[2022-05-13 23:36:46 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][890/6672] eta 0:46:46 lr 0.000421 time 0.4609 (0.4853) loss 3.8626 (4.9899) grad_norm 2.3808 (2.8147) mem 7785MB
[2022-05-13 23:36:51 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][900/6672] eta 0:46:40 lr 0.000421 time 0.4697 (0.4851) loss 5.3830 (4.9896) grad_norm 2.3411 (2.8122) mem 7785MB
[2022-05-13 23:36:56 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][910/6672] eta 0:46:35 lr 0.000421 time 0.4822 (0.4852) loss 3.9433 (4.9898) grad_norm 2.6711 (2.8111) mem 7785MB
[2022-05-13 23:37:01 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][920/6672] eta 0:46:31 lr 0.000421 time 0.4837 (0.4852) loss 5.8288 (4.9927) grad_norm 2.7145 (2.8098) mem 7785MB
[2022-05-13 23:37:06 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][930/6672] eta 0:46:26 lr 0.000421 time 0.4757 (0.4852) loss 5.0991 (4.9949) grad_norm 2.9705 (2.8072) mem 7785MB
[2022-05-13 23:37:10 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][940/6672] eta 0:46:20 lr 0.000421 time 0.4732 (0.4850) loss 5.0595 (4.9934) grad_norm 3.3814 (2.8068) mem 7785MB
[2022-05-13 23:37:15 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][950/6672] eta 0:46:14 lr 0.000421 time 0.4738 (0.4849) loss 3.7327 (4.9873) grad_norm 2.5955 (2.8050) mem 7785MB
[2022-05-13 23:37:20 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][960/6672] eta 0:46:08 lr 0.000421 time 0.4729 (0.4847) loss 5.1887 (4.9857) grad_norm 2.2804 (2.8031) mem 7785MB
[2022-05-13 23:37:25 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][970/6672] eta 0:46:03 lr 0.000421 time 0.4771 (0.4846) loss 4.8851 (4.9864) grad_norm 3.8150 (2.8028) mem 7785MB
[2022-05-13 23:37:29 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][980/6672] eta 0:45:57 lr 0.000421 time 0.4817 (0.4845) loss 5.2875 (4.9864) grad_norm 2.5579 (2.8007) mem 7785MB
[2022-05-13 23:37:34 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][990/6672] eta 0:45:52 lr 0.000421 time 0.4870 (0.4844) loss 5.1356 (4.9867) grad_norm 2.7010 (2.8004) mem 7785MB
[2022-05-13 23:37:39 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1000/6672] eta 0:45:46 lr 0.000421 time 0.4722 (0.4843) loss 4.7810 (4.9856) grad_norm 3.4885 (2.8016) mem 7785MB
[2022-05-13 23:37:43 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1010/6672] eta 0:45:41 lr 0.000421 time 0.4764 (0.4841) loss 5.5549 (4.9877) grad_norm 2.4293 (2.7994) mem 7785MB
[2022-05-13 23:37:48 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1020/6672] eta 0:45:35 lr 0.000421 time 0.4646 (0.4840) loss 5.3800 (4.9887) grad_norm 2.8842 (2.7980) mem 7785MB
[2022-05-13 23:37:53 swin_small_patch4_window7_224](main.py 229): INFO Train: [16/300][1030/6672] eta 0:45:30 lr 0.000421 time 0.4699 (0.4839) loss 5.5460 (4.9855) grad_norm 2.5153 (2.7954) mem 7785MB

Memory (4 GPUs with bs_48, i think it has nothing with bs)

root:/workspace/Swin-Transformer# free -g
total used free shared buff/cache available
Mem: 251 181 24 2 45 65
Swap: 0 0 0
root
/workspace/Swin-Transformer#

@ZJLi2013
Copy link

ZJLi2013 commented Jun 6, 2022

wonder how to generate the zipped imageNet labels, e.g. train_map.txt ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants