-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Continue training with new data #588
Comments
Updating the vocabulary during training is not supported (there exists a technique for this implemented in some other NMT frameworks, but not in T2T, it has been discussed I think here in issues or gitter or google groups). However, the T2T internal subwords are robust enough to encode unseen words or even characters (although not optimally). Resumed training (e.g. for the purpose of domain adaptation) is supported, just be careful with the learning rate (which follows a given decay schedule which is guided by global_step which is stored in the checkpoint) and ADAM moments (which are stored in the checkpoint). |
Thanks a lot Marin, that's really useful! I have some more follow-up questions:
Thank you! |
Yes, there are papers reporting that resetting ADAM moments to zero from time to time helps (even when not doing domain adaptation). I'm not convinced this is the best way, but if you want you can ad-hoc edit the checkpoints, see
Yes, I think so.
No. Adam has "bias handling" which is a kind of warmup and I am not sure now whether the current implementation depends on global_step. In addition there is the learning rate warmup if you use the default |
I had a similar issue and raised it on the google-group discussion, Lukasz mentioned something similar as @martinpopel "...T2T internal subwords are robust enough to encode unseen words or even characters.." and how the individual characters which are part of the generated vocabulary can help in mapping unseen words. The OpenNMT approach for updating the vocabulary discussed here http://opennmt.net/OpenNMT/training/retraining/#updating-the-vocabularies Best, |
The problem is how can I tokenize and encode the new training data using my subword list generated from my old data? Thank you very much. |
For new data, you need to run |
Hello all,
I'm using my own data for training a transformer model for machine translation. I am using the standard pipeline with t2t-datagen and t2t-trainer and it's fine to train the model. In some use cases such as domain adaptation I need to continue training using a new dataset (domain data for example) and update the vocabulary using the new sub-words if possible.
Is this scenario supported in tensor2tensor?
Thank you!
Talaat
The text was updated successfully, but these errors were encountered: