Skip to content
This repository has been archived by the owner on Oct 30, 2019. It is now read-only.

training process halt in few epochs #109

Open
tzzcl opened this issue Oct 2, 2016 · 2 comments
Open

training process halt in few epochs #109

tzzcl opened this issue Oct 2, 2016 · 2 comments

Comments

@tzzcl
Copy link

tzzcl commented Oct 2, 2016

When I use the given example to train on ResNet, It will sometimes crash on some iterations, not pushing to the next sub batch.

I use these command to train the network:
CUDA_VISIBLE_DEVICES=12,13,14,15 OMP_NUM_THREADS=4 th main.lua -data /opt/xxx/images/ -resume checkpoints -batchSize 128 -nThreads 8 -nGPU 4 -shareGradInput true

is there anybody encounter the same problem?

@colesbury
Copy link
Contributor

Do you have a stack trace from when it crashes?

@DmitryUlyanov
Copy link

I have a similar issue, in my case there is completely no errors popping out. It just freezes.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants