Usage & concept questions #18

ChongWu-Biostat · 2019-10-27T20:31:15Z

It works perfectly with me. Thank you for sharing and developing this repo. I think this idea really works (at least for my problem).

Thanks,
Chong

ChongWu-Biostat · 2019-10-27T20:36:12Z

Just a quick question:
Can you explain and provide some guidance about the parameters?

wd_dict = get_weight_decays(model) # {'lstm_1/recurrent:0':0, 'output/kernel:0':0}
weight_decays = fill_dict_in_order(wd_dict,[4e-4,1e-4]) # {'lstm_1/recurrent:0':4e-4,'output/kernel:0':1e-4}
lr_multipliers = {'lstm_1':0.5}

optimizer = AdamW(lr=1e-4, weight_decays=weight_decays, lr_multipliers=lr_multipliers,
use_cosine_annealing=True, total_iterations=24)

If I understand correctly, weight_decays is similar to the L2 penalty. What is lr_multipliers really stands for? Do we have to give it a same name as input ("lstm_1")?
What's the total iteration (total_iterations) mean?

use_cosine_annealing means we use a large learning rate after some time, right?

Thank you for your help. I think your repo is way better than any other adamw version in Keras.

OverLordGoldDragon · 2019-10-27T21:30:04Z

@ChongWu-Biostat You're welcome, glad you find it useful.

Suppose I'll make a more detailed example to explain in case the README didn't suffice, but for now I'll respond to your questions:

Weight decays vs. L2 penalty

The key difference between weight_decays and L2-penalty is, latter's included in gradient and loss computation, former isn't (image from paper below). Turns out, latter is not desirable, as the L2 penalty gets included in momentum and RMS (rmsprop) computations, which:

Couples (forces dependency) between learning rate and lambda (weight decay)
Makes weight decay less effective for weight matrices with large gradients
Makes weight decay inconsistent across iterations depending on gradients

By fixing weight decay rate and separating it from loss, all of the above are remedied.

How to use lr_multipliers?

Suppose you have a model: Input -> Conv1D -> Conv1D -> LSTM -> Dense, and you've pretrained the Conv1D layers for feature extraction, and want to use an LSTM as an additional layer. If you use the same learning rate for all layers, your Conv1D may overfit - and a good workaround is to set per-layer lr, which could look something like:

1e-4 -> 1e-4 -> 1e-3 -> 1e-3

(Input has no lr). So, pretrained layers' lr is 10x less. To achieve this, lr_multipliers detects layers by names specified in lr_multipliers dictionary keys, and applies the multipliers specified in their values to each of the layers. Example:

learning_rate=1e-3; lr_multipliers = {'conv1d_1':0.1, 'conv1d_2':0.1}

Names don't have to match exactly; substrings work also: {'conv1d':0.1} will apply 0.1 multiplier to every layer whose name contains the substring 'conv1d'.

What is cosine annealing?

lr gets multiplied according to the function below, whose interval (max-to-min # of iterations) is defined by total_iterations: (T_i below is total_iterations)

For example, at approx. iterations=11, we have eta_t=0.5, so if your lr=1e-3, it becomes 5e-4.

ChongWu-Biostat · 2019-10-28T18:35:08Z

Got it. Thank you for your explanation. I understand it now.

OverLordGoldDragon added the question Further information is requested label Oct 27, 2019

ChongWu-Biostat closed this as completed Oct 28, 2019

OverLordGoldDragon pinned this issue Oct 28, 2019

OverLordGoldDragon changed the title ~~Thank you for this repo~~ Usage & concept questions Oct 28, 2019

OverLordGoldDragon mentioned this issue Feb 24, 2020

Reproducing a stack overflow example OverLordGoldDragon/see-rnn#10

Closed

OverLordGoldDragon mentioned this issue May 30, 2020

SGDW doesn't work #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Usage & concept questions #18

Usage & concept questions #18

ChongWu-Biostat commented Oct 27, 2019

ChongWu-Biostat commented Oct 27, 2019

OverLordGoldDragon commented Oct 27, 2019 •

edited

Loading

ChongWu-Biostat commented Oct 28, 2019

Usage & concept questions #18

Usage & concept questions #18

Comments

ChongWu-Biostat commented Oct 27, 2019

ChongWu-Biostat commented Oct 27, 2019

OverLordGoldDragon commented Oct 27, 2019 • edited Loading

ChongWu-Biostat commented Oct 28, 2019

OverLordGoldDragon commented Oct 27, 2019 •

edited

Loading