Skip to content
This repository has been archived by the owner on Oct 30, 2019. It is now read-only.

input and gradOutput have different number of elements when using multi-GPU #143

Open
kuixu opened this issue Dec 13, 2016 · 4 comments
Open

Comments

@kuixu
Copy link

kuixu commented Dec 13, 2016

Dear All,

The code runs smoothly without error on my own dataset with a single GPU and 4-GPU, but when using 8-GPUs, it encounters the error below. I tried many times, failed with the same error below. Why the gradOutput is reduced? Something wrong with the data?

 | Epoch: [1][87/92]    Time 6.657  Data 0.000  Err 1.2427  top1 100.000  top5  96.904
 | Epoch: [1][88/92]    Time 6.647  Data 0.000  Err 1.2477  top1 100.000  top5  97.309
 | Epoch: [1][89/92]    Time 6.410  Data 0.000  Err 1.1976  top1 100.000  top5  97.244
 | Epoch: [1][90/92]    Time 5.759  Data 0.000  Err 1.1808  top1 100.000  top5  97.732
 | Epoch: [1][91/92]    Time 6.149  Data 0.000  Err 1.1948  top1 100.000  top5  97.493
/home/scs4850/torch/install/bin/luajit: .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 19 module of nn.Sequential:
/home/scs4850/torch/install/share/lua/5.1/nn/THNN.lua:110: input and gradOutput have different number of elements: input[135000 x 26] has 3510000 elements, while gradOutput[121500 x 26] has 3159000 elements at /tmp/luarocks_cunn-scm-1-6007/cunn/lib/THCUNN/generic/Threshold.cu:44
stack traceback:
        [C]: in function 'v'
        /home/scs4850/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Threshold_updateGradInput'
        /home/scs4850/torch/install/share/lua/5.1/nn/Threshold.lua:32: in function 'updateGradInput'
        /home/scs4850/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/scs4850/torch/install/share/lua/5.1/nn/Module.lua:29>
        [C]: in function 'xpcall'
        /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
        [C]: in function 'error'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
        ...0/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:717: in function 'exec'
        ...0/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:229: in function 'backward'
        ./train.lua:77: in function 'train'
        main.lua:50: in main chunk
        [C]: in function 'dofile'
        ...4850/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

Thanks

@colesbury
Copy link
Contributor

That's strange. Have you changed some other part of the code or model?

@kuixu
Copy link
Author

kuixu commented Dec 21, 2016

@colesbury Yes, I modified the code based on my own project, but just the dimension of the related variables. It is strange that it runs smoothly without error with a single GPU and 4-GPU as I mentioned above.

@colesbury
Copy link
Contributor

It looks like it's happening at the end of an epoch. Maybe something isn't handling odd-sized batches?

@cardwing
Copy link

cardwing commented Aug 1, 2018

@colesbury , it is the exact cause. You can fix this problem via making the number of training samples divisible by batch size. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants