training process halt in few epochs #109

tzzcl · 2016-10-02T09:10:05Z

When I use the given example to train on ResNet, It will sometimes crash on some iterations, not pushing to the next sub batch.

I use these command to train the network:
CUDA_VISIBLE_DEVICES=12,13,14,15 OMP_NUM_THREADS=4 th main.lua -data /opt/xxx/images/ -resume checkpoints -batchSize 128 -nThreads 8 -nGPU 4 -shareGradInput true

is there anybody encounter the same problem?

colesbury · 2016-10-17T17:27:50Z

Do you have a stack trace from when it crashes?

DmitryUlyanov · 2016-11-03T17:13:47Z

I have a similar issue, in my case there is completely no errors popping out. It just freezes.

…CPU memory

davidemaz referenced this issue in davidemaz/fb.resnet.torch Feb 2, 2017

Avoid memory leaks when saving models by copying tensors directly to …

df5b75c

…CPU memory

davidemaz mentioned this issue Feb 3, 2017

Fix memory leak when saving models #163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training process halt in few epochs #109

training process halt in few epochs #109

tzzcl commented Oct 2, 2016

colesbury commented Oct 17, 2016

DmitryUlyanov commented Nov 3, 2016

training process halt in few epochs #109

training process halt in few epochs #109

Comments

tzzcl commented Oct 2, 2016

colesbury commented Oct 17, 2016

DmitryUlyanov commented Nov 3, 2016