Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

porting cxxnet data science bowl 1 example to mxnet #1143

Closed
Lopezurrutia opened this issue Jan 3, 2016 · 13 comments
Closed

porting cxxnet data science bowl 1 example to mxnet #1143

Lopezurrutia opened this issue Jan 3, 2016 · 13 comments

Comments

@Lopezurrutia
Copy link
Contributor

I am trying to port to mxnet the example code in cxxnet for Kaggle data science bowl (1) https://github.com/dmlc/cxxnet/tree/master/example/kaggle_bowl

I have most things working but I do not understand something in the bowl.conf for cxxnet. (excuse my ignorance but I am new to deep learning, I am a marine biologist with interests in image classification).

Although images are scaled to 48x48, the bowl.conf file has "input_shape = 3,40,40", why is that?
In mxnet, I have rescaled all images to 48x48 but if I use
data_shape = (3, 48, 48)
when I define the ImageRecordIter as:

def get_iterator(args, kv):
    data_shape = (3, 48, 48)
    train = mx.io.ImageRecordIter(
        path_imgrec = args.data_dir + "tr.rec",
        mean_img    = args.data_dir + "mean.bin",
        data_shape  = data_shape,
        batch_size  = args.batch_size,
        rand_crop   = True,
        rand_mirror = True,
        max_rotate_angle=180,
        max_aspect_ratio = 0.5,
        max_shear_ratio = 0.3,
        min_crop_size=32,
        max_crop_size=48,
        num_parts   = kv.num_workers,
        part_index  = kv.rank)

    val = mx.io.ImageRecordIter(
        path_imgrec = args.data_dir + "va.rec",
        mean_img    = args.data_dir + "mean.bin",
        data_shape  = data_shape,
        batch_size  = args.batch_size,
        num_parts   = kv.num_workers,
        part_index  = kv.rank)
     return (train, val)

I get train/val accuracies of 0.44/0.59 while if I use a data_shape=(3,40,40) I get 0.58/0.59
I guess it is something to do with the crop sizes, could someone explain how to set the correct data_shape?
Thanks ,
Angel

this is the network structure I am using (translated from bowl.conf):


def get_symbol(num_classes = 121):
    input_data = mx.symbol.Variable(name="data")
    conv1 = mx.symbol.Convolution(
        data=input_data, kernel=(4, 4), stride=(1, 1), num_filter=48,pad=(2,2))
    relu1 = mx.symbol.Activation(data=conv1, act_type="relu")
    pool1 = mx.symbol.Pooling(
        data=relu1, pool_type="max", kernel=(3, 3), stride=(2,2))
    ###############                                                                                                                                                             
    conv2 = mx.symbol.Convolution(
        data=pool1, kernel=(3, 3), pad=(1, 1), stride=(1, 1), num_filter=96)
    relu2 = mx.symbol.Activation(data=conv2, act_type="relu")
    conv3 = mx.symbol.Convolution(
        data=relu2, kernel=(3, 3), pad=(1, 1), stride=(1, 1), num_filter=96)
    relu3 = mx.symbol.Activation(data=conv3, act_type="relu")
    pool2 = mx.symbol.Pooling(data=relu3, kernel=(3, 3), stride=(2, 2), pool_type="max")
    ###############                                                                                                                                                             
    conv4 = mx.symbol.Convolution(
        data=pool2, kernel=(2, 2), stride=(1, 1), num_filter=128)
    relu4 = mx.symbol.Activation(data=conv4, act_type="relu")
    conv5 = mx.symbol.Convolution(
        data=relu4, kernel=(3, 3), stride=(1, 1), num_filter=128)
    pool3 = mx.symbol.Pooling(data=conv5, kernel=(3, 3), stride=(2, 2), pool_type="max")
    ##                                                                                                                                                                          
    flatten = mx.symbol.Flatten(data=pool3)
    fc1 = mx.symbol.FullyConnected(data=flatten, num_hidden=256)
    dropout1 = mx.symbol.Dropout(data=fc1, p=0.5)
    fc2 = mx.sym.FullyConnected(data=dropout1, num_hidden=num_classes)
    softmax = mx.symbol.SoftmaxOutput(data=fc2, name='softmax')
    return softmax
@antinucleon
Copy link
Contributor

Hi @Gelu74,

I will probably make an example in of original sample today and post it at here. After you successfully run it, I recommend to try example here: https://github.com/auroraxie/Kaggle-NDSB

@Lopezurrutia
Copy link
Contributor Author

Thanks @antinucleon
Look forward to it, yes the solution by auroraxie was my next step, I am trying to understand the simpler model first.

@antinucleon
Copy link
Contributor

@Gelu74
Here is a new example. Also you may find MXNet document to be very useful: https://mxnet.readthedocs.org/en/latest/

@antinucleon
Copy link
Contributor

@Lopezurrutia
Copy link
Contributor Author

Thanks a lot @antinucleon for taking the time to write the example.
Works great except the second sed command that in my system throws an error, I think it should be:
sed -n '20001, 30337p' train.lst > va.lst (i.e without the "p" after 20001)
The training runs ok but I am reaching very different accuracies from run to run. I guess it is due to the random initialization of the network and reaching some local minimum

These are my results for 5 consecutive runs:

INFO:root:Epoch[34] Train-accuracy=0.539437
INFO:root:Epoch[34] Time cost=105.434
INFO:root:Epoch[34] Validation-accuracy=0.576292

INFO:root:Epoch[34] Train-accuracy=0.065096
INFO:root:Epoch[34] Time cost=103.263
INFO:root:Epoch[34] Validation-accuracy=0.061150

INFO:root:Epoch[34] Train-accuracy=0.570038
INFO:root:Epoch[34] Time cost=36.997
INFO:root:Epoch[34] Validation-accuracy=0.604167

INFO:root:Epoch[34] Train-accuracy=0.065096
INFO:root:Epoch[34] Time cost=36.892
INFO:root:Epoch[34] Validation-accuracy=0.061150

INFO:root:Epoch[34] Train-accuracy=0.601637
INFO:root:Epoch[34] Time cost=36.978
INFO:root:Epoch[34] Validation-accuracy=0.630112

@antinucleon
Copy link
Contributor

That's strange and should not have such a great gap. Are you using cudnn? Which card are you using?

@Lopezurrutia
Copy link
Contributor Author

Yes cudnn.
Some info of my system:
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

lspci | grep -i nvidia
02:00.0 3D controller: NVIDIA Corporation GK110BGL [Tesla K40m](rev a1)

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17

@antinucleon
Copy link
Contributor

cudnn may make different output but not in such a great gap. Could you try to add this line at the beginning of the file? mx.random.seed(8964)

@Lopezurrutia
Copy link
Contributor Author

With that random seed, I did not get the low accuracy results in four runs. Although I am still getting quite some variability:

INFO:root:Epoch[34] Train-accuracy=0.550869
INFO:root:Epoch[34] Time cost=37.199
INFO:root:Epoch[34] Validation-accuracy=0.598765

INFO:root:Epoch[34] Train-accuracy=0.510184
INFO:root:Epoch[34] Time cost=37.007
INFO:root:Epoch[34] Validation-accuracy=0.547164

INFO:root:Epoch[34] Train-accuracy=0.460114
INFO:root:Epoch[34] Time cost=37.002
INFO:root:Epoch[34] Validation-accuracy=0.502990

INFO:root:Epoch[34] Train-accuracy=0.538588
INFO:root:Epoch[34] Time cost=36.958
INFO:root:Epoch[34] Validation-accuracy=0.580633

I thought that by setting the random seed there wouldn't be any variability.. where does the randomness come from?

@antinucleon
Copy link
Contributor

Try to build without CuDNN. CuDNN doesn't grantee to reproduce result.
Also, in iterator, we should set a seed as we are using random
augmentation.

On Wed, Jan 6, 2016 at 4:53 PM, Angel Lopez-Urrutia <
notifications@github.com> wrote:

With that random seed, I did not get the low accuracy results in four
runs. Although I am still getting quite some variability:

INFO:root:Epoch[34] Train-accuracy=0.550869
INFO:root:Epoch[34] Time cost=37.199
INFO:root:Epoch[34] Validation-accuracy=0.598765

INFO:root:Epoch[34] Train-accuracy=0.510184
INFO:root:Epoch[34] Time cost=37.007
INFO:root:Epoch[34] Validation-accuracy=0.547164

INFO:root:Epoch[34] Train-accuracy=0.460114
INFO:root:Epoch[34] Time cost=37.002
INFO:root:Epoch[34] Validation-accuracy=0.502990

INFO:root:Epoch[34] Train-accuracy=0.538588
INFO:root:Epoch[34] Time cost=36.958
INFO:root:Epoch[34] Validation-accuracy=0.580633

I thought that by setting the random seed there wouldn't be any
variability.. where does the randomness come from?


Reply to this email directly or view it on GitHub
#1143 (comment).

@Lopezurrutia
Copy link
Contributor Author

sorry, there must have been an error with my cudnn installation and I guess mxnet was not using cudnn.
I am trying to reinstall now but I am encountering some errors building mxnet with CUDNN support. I will open a separate issue.
thanks!

@antinucleon
Copy link
Contributor

No it is not your problem, for CuDNN fastest mode, it is known that the result is not deterministic.

@Lopezurrutia
Copy link
Contributor Author

ummm, I am now not sure whether I was running with cudnn at all... I have reinstalled my system and I am not able to compile with cudnn support now.. see #1207

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants