-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nnet1 dropout ivec #1090
Nnet1 dropout ivec #1090
Conversation
KarelVesely84
commented
Oct 3, 2016
•
edited
Loading
edited
- adding support to annealed dropout,
- created an example how to prepare kaldi i-vectors on fMLLR features (AMI,IHM)
- nice gains on AMI IHM (dev, eval): w/o i-vector 24.2 24.5, per-spk i-vector 23.2 22.8, (Dan's lattice-free MMI 22.4 22.4)
- changing the scripts to support the 'annealed dropout',
@@ -22,7 +22,7 @@ feature_transform= | |||
max_iters=20 | |||
min_iters=0 # keep training, disable weight rejection, start learn-rate halving as usual, | |||
keep_lr_iters=0 # fix learning rate for N initial epochs, disable weight rejection, | |||
dropout_iters= # Disable dropout after 'N' initial epochs, | |||
dropout_schedule= # Dropout schedule for N initial epochs, for example: 0.9,0.9,0.9,0.9,0.9,1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these probabilities the probability of dropout, or the probability of not dropping out?
I think when people normally describe dropout, it's the probability of setting the feature to zero, e.g. see
https://pdfs.semanticscholar.org/c2d7/8722ebac92766f1154497d8424108d906ae3.pdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps if you renamed this to dropout_retention_schedule it would lead to less confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dan, how do you have it in nnet2, nnet3? As the probability that neuron is dropped? It would be good to have it the same way, I will change it then...
(Actually, I was already thinking about it and discussed it with Harish)
- there's backward compatibility in Dropout::Read()
Hi Karel, Harish |
Hi Harish (@mallidi), This is a more standard formulation as @danpovey pointed out. Given that the values in 'dropout_schedule' are the probabilities that the neurons are dropped, we can use original variable name 'dropout_schedule'. The c++ code knows how to read the older models with 'DropoutRetention' instead of 'DropoutRate', so there is backward compatibility... Is it okay for your needs? |
Sure @vesis84 Thanks a lot for the annealed dropout Harish. |
You are welcome ;) What seemed to work well on 'ami-ihm' was dropout-rate 0.2 in 5 initial epochs, then switching it to 0.0 (no dropout). This starts the learning rate decay, as without dropout the cross-entropy immediately increases on 'cv' data (and massively decreases on 'tr' data)... This schedule was better than 0.5 0.4 0.3 0.2 0.1 0.0, and some other combinations. In the implementation there is one detail, after applying the dropout mask the output is up-scaled by 1/(1-p_drop). While, the cross-validations are always without dropout (hard-coded in the training binaries). So the up-scaling does a good job, there doesn't seem to be a severe mismatch caused by disabling the dropout in the cross-validation step... [I tried to exponentiate the 1/(1-p_drop) to make it little larger / smaller, but this was causing a mismatch visible on 'cv' loss] K. |
@danpovey I am done with the changes, from my side it's ready. |