Continue training with new data #588

tmkhalil · 2018-02-15T15:04:29Z

Hello all,

I'm using my own data for training a transformer model for machine translation. I am using the standard pipeline with t2t-datagen and t2t-trainer and it's fine to train the model. In some use cases such as domain adaptation I need to continue training using a new dataset (domain data for example) and update the vocabulary using the new sub-words if possible.
Is this scenario supported in tensor2tensor?

Thank you!
Talaat

martinpopel · 2018-02-15T15:54:20Z

Updating the vocabulary during training is not supported (there exists a technique for this implemented in some other NMT frameworks, but not in T2T, it has been discussed I think here in issues or gitter or google groups). However, the T2T internal subwords are robust enough to encode unseen words or even characters (although not optimally).

Resumed training (e.g. for the purpose of domain adaptation) is supported, just be careful with the learning rate (which follows a given decay schedule which is guided by global_step which is stored in the checkpoint) and ADAM moments (which are stored in the checkpoint).
A simple way (hack) how to set global_step to zero in a given checkpoint is to use avg_checkpoints.py with just that one file. I don't say it is a good idea to set the global_step to zero.

tmkhalil · 2018-02-15T16:20:37Z

Thanks a lot Marin, that's really useful!
Would be great also if we can have control over the optimizers states.

I have some more follow-up questions:

Is it as simple as providing a different data directory to the trainer?
Will setting global_step to zero reset the adam parameters to the initial state?

Thank you!

martinpopel · 2018-02-15T17:23:37Z

Would be great also if we can have control over the optimizers states.

Yes, there are papers reporting that resetting ADAM moments to zero from time to time helps (even when not doing domain adaptation). I'm not convinced this is the best way, but if you want you can ad-hoc edit the checkpoints, see avg_checkpoints.py for inspiration (I've tried to simplify it, but was not successful).

Is it as simple as providing a different data directory to the trainer?

Yes, I think so.

Will setting global_step to zero reset the adam parameters to the initial state?

No. Adam has "bias handling" which is a kind of warmup and I am not sure now whether the current implementation depends on global_step. In addition there is the learning rate warmup if you use the default noam scheme (renamed in the newest T2T version). In my experiments, starting with global_step=0 and trained model resulted in a diverged training.

surafelml · 2018-02-15T17:56:38Z

I had a similar issue and raised it on the google-group discussion, Lukasz mentioned something similar as @martinpopel "...T2T internal subwords are robust enough to encode unseen words or even characters.." and how the individual characters which are part of the generated vocabulary can help in mapping unseen words.

The OpenNMT approach for updating the vocabulary discussed here http://opennmt.net/OpenNMT/training/retraining/#updating-the-vocabularies

Best,
Surafel.

cwlinghk · 2018-05-02T08:19:16Z

Is it as simple as providing a different data directory to the trainer?
Yes, I think so.

The problem is how can I tokenize and encode the new training data using my subword list generated from my old data? Thank you very much.

martinpopel · 2018-05-02T09:20:39Z

For new data, you need to run t2t-datagen first and let it use the original vocabulary file (it will automatically reuse it, if the file exists).

rsepassi added the question label Mar 20, 2018

mzeidhassan mentioned this issue Dec 26, 2018

Training from an existing model modernmt/modernmt#433

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue training with new data #588

Continue training with new data #588

tmkhalil commented Feb 15, 2018

martinpopel commented Feb 15, 2018

tmkhalil commented Feb 15, 2018

martinpopel commented Feb 15, 2018

surafelml commented Feb 15, 2018

cwlinghk commented May 2, 2018

martinpopel commented May 2, 2018

Continue training with new data #588

Continue training with new data #588

Comments

tmkhalil commented Feb 15, 2018

martinpopel commented Feb 15, 2018

tmkhalil commented Feb 15, 2018

martinpopel commented Feb 15, 2018

surafelml commented Feb 15, 2018

cwlinghk commented May 2, 2018

martinpopel commented May 2, 2018