input and gradOutput have different number of elements when using multi-GPU #143

kuixu · 2016-12-13T14:24:57Z

Dear All,

The code runs smoothly without error on my own dataset with a single GPU and 4-GPU, but when using 8-GPUs, it encounters the error below. I tried many times, failed with the same error below. Why the gradOutput is reduced? Something wrong with the data?

 | Epoch: [1][87/92]    Time 6.657  Data 0.000  Err 1.2427  top1 100.000  top5  96.904
 | Epoch: [1][88/92]    Time 6.647  Data 0.000  Err 1.2477  top1 100.000  top5  97.309
 | Epoch: [1][89/92]    Time 6.410  Data 0.000  Err 1.1976  top1 100.000  top5  97.244
 | Epoch: [1][90/92]    Time 5.759  Data 0.000  Err 1.1808  top1 100.000  top5  97.732
 | Epoch: [1][91/92]    Time 6.149  Data 0.000  Err 1.1948  top1 100.000  top5  97.493
/home/scs4850/torch/install/bin/luajit: .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 19 module of nn.Sequential:
/home/scs4850/torch/install/share/lua/5.1/nn/THNN.lua:110: input and gradOutput have different number of elements: input[135000 x 26] has 3510000 elements, while gradOutput[121500 x 26] has 3159000 elements at /tmp/luarocks_cunn-scm-1-6007/cunn/lib/THCUNN/generic/Threshold.cu:44
stack traceback:
        [C]: in function 'v'
        /home/scs4850/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'Threshold_updateGradInput'
        /home/scs4850/torch/install/share/lua/5.1/nn/Threshold.lua:32: in function 'updateGradInput'
        /home/scs4850/torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/scs4850/torch/install/share/lua/5.1/nn/Module.lua:29>
        [C]: in function 'xpcall'
        /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /home/scs4850/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:84: in function </home/scs4850/torch/install/share/lua/5.1/nn/Sequential.lua:78>
        [C]: in function 'xpcall'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/scs4850/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
        [C]: in function 'error'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        .../scs4850/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
        ...0/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:717: in function 'exec'
        ...0/torch/install/share/lua/5.1/cunn/DataParallelTable.lua:229: in function 'backward'
        ./train.lua:77: in function 'train'
        main.lua:50: in main chunk
        [C]: in function 'dofile'
        ...4850/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
        [C]: at 0x00406670

Thanks

The text was updated successfully, but these errors were encountered:

colesbury · 2016-12-14T20:41:31Z

That's strange. Have you changed some other part of the code or model?

kuixu · 2016-12-21T14:36:28Z

@colesbury Yes, I modified the code based on my own project, but just the dimension of the related variables. It is strange that it runs smoothly without error with a single GPU and 4-GPU as I mentioned above.

colesbury · 2016-12-22T20:26:58Z

It looks like it's happening at the end of an epoch. Maybe something isn't handling odd-sized batches?

cardwing · 2018-08-01T14:43:59Z

@colesbury , it is the exact cause. You can fix this problem via making the number of training samples divisible by batch size. Thanks!

cardwing mentioned this issue Aug 1, 2018

Training error after modifying nn.View and nn.Linear XingangPan/SCNN#44

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

input and gradOutput have different number of elements when using multi-GPU #143

input and gradOutput have different number of elements when using multi-GPU #143

kuixu commented Dec 13, 2016

colesbury commented Dec 14, 2016

kuixu commented Dec 21, 2016

colesbury commented Dec 22, 2016

cardwing commented Aug 1, 2018

input and gradOutput have different number of elements when using multi-GPU #143

input and gradOutput have different number of elements when using multi-GPU #143

Comments

kuixu commented Dec 13, 2016

colesbury commented Dec 14, 2016

kuixu commented Dec 21, 2016

colesbury commented Dec 22, 2016

cardwing commented Aug 1, 2018