Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

How to measure the number of training epochs #415

Open
martinpopel opened this issue Nov 13, 2017 · 22 comments
Open

How to measure the number of training epochs #415

martinpopel opened this issue Nov 13, 2017 · 22 comments

Comments

@martinpopel
Copy link
Contributor

In order to compare with other NMT frameworks, I would like to know how many training epochs (i.e. passes over the whole training data) are done at the moment.
I can see the number of training (global) steps and I guess epochs = steps * batch_size / training_subwords.
So the questions boils down to: How to make T2T report (e.g. in the log) the number of subwords in the training data?

@rsepassi
Copy link
Contributor

Yeah, this seems like a reasonable thing to want but unfortunately not simple to do currently. The variable batch size because of bucketing examples by sequence length complicates the picture.

Counting the number of subwords would need a pass through the data on disk, probably best done by a separate script.

@yuimo
Copy link

yuimo commented Nov 29, 2017

@rsepassi hi, "batch_size" is the number of subwords of source and target sentences in a batch?
or only the number of subwords of source sentences in a batch?
thanks a lot.

@martinpopel
Copy link
Contributor Author

@yuimo: it is the maximum of source and target subwords, for each sentence. See

def _example_length(example):
length = 0
# Length of the example is the maximum length of the feature lengths
for v in example.values():
# For images the sequence length is the size of the spatial dimensions.
feature_length = (tf.shape(v)[0] if len(v.get_shape()) < 3 else
tf.shape(v)[0] * tf.shape(v)[1])
length = tf.maximum(length, feature_length)
return length

@martinpopel martinpopel mentioned this issue Nov 29, 2017
@yuimo
Copy link

yuimo commented Nov 30, 2017

@martinpopel i got it, thanks a lot

@ndvbd
Copy link
Contributor

ndvbd commented Feb 13, 2018

@martinpopel , you meant to write:

epochs = steps * batch_size * worker_gpu / training_subwords

Right?

@martinpopel
Copy link
Contributor Author

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

@ndvbd
Copy link
Contributor

ndvbd commented Feb 13, 2018

@martinpopel It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

@martinpopel
Copy link
Contributor Author

It would be nice if the T2T will show in the tensorboard how many epochs were done during training.

Yes, that would be nice, but there are two problems:

  • How to compute the number of epochs exactly? The formula above does not handle zero-padding, so it is just an upper bound on the number of epochs (I think). TensorBoard reports input_stats/targets_nonpadding_fraction and input_stats/inputs_nonpadding_fraction, so there is a way how to compute the number of epochs. Ideally t2t-datagen should report to stderr the number of subwords (as my script does) and t2t-trainer should report the number of epochs (or how many steps are in one epoch after the first epoch has ended).
  • How to present this number in TensorBoard? Currently, TensorBoard offers just "Step", "Relative" and "Wall" as the options for the x-axis and I doubt there is a way to provide other options (maybe plugins?). Also, I am not sure what is more helpful: epochs or number of training examples? For a given training data, these two options don't change the curves, just the x-axis labels, but when comparing experiments with different training data size, I guess the number of training examples is more relevant.

Do you know if there are any rules of thumb in respect to how many epochs should be done during NMT tasks?

The standard&naive answer is "until converged on dev set", but this is difficult to measure (how to set early stopping parameters) and achieve. My training data has about half a gigaword and even 18 epochs (11 days of training on 8 GPUs) were not enough to reach the highest possible BLEU.

In addition, do you know if T2T runs on the training data in a deterministic way, or in a randomized way? (meaning if 2 training invocations should yield the exact same model?)

It should be randomized and deterministic (thanks to the fixed rand seed), but I am waiting for the ultimate answer from the T2T authors, see #556 (comment) and the posts below.

@ndvbd
Copy link
Contributor

ndvbd commented Feb 21, 2018

@martinpopel, why not to go with simply the % of sentences (examples) completed, instead of subwords?

If we completed 100% of the cases in the training data -> We reached to 1.0 epochs, and so on?
I don't think we need to go into the subwords resolution.

@martinpopel
Copy link
Contributor Author

@NadavB T2T computes batch_size in subwords (for translation problems with variable length). One batch may contain a small number of long sentences or a high number of short sentences.
T2T does not report the number of sentences processed, it reports just the number of steps (batches).
Thus, we need to know the total number of subwords in the training data, in order to estimate the number of epochs.
Of course, if you know the total number of sentences in the training data, you could estimate that x % of sentences are processed when x % of subwords are processed.

@ndvbd
Copy link
Contributor

ndvbd commented Feb 27, 2018

I probably don't understand something. Why do we care about subwords when we talk about epochs?
In the training data, we have input and output sentences.
During training, these sentences are being converted to subwords and then being sent to the different GPUs. The code that takes these sentences know how many sentences it took (and converted to subwords) in each step. We can simply have a counter counting the number of sentences passed. It must be somewhere anyhow, in order not to process a sentence twice. (Some Data Reader). That's it. I don't understand why it is so difficult to keep track on how many sentences we read from the training files. I understand that "One batch may contain a small number of long sentences or a high number of short sentences." - but we don't care how many sentences are in a batch. We only want to know how many sentences we took from the training data set before we converted them and send them to the GPU, hold a counter, and that's it.

@martinpopel
Copy link
Contributor Author

We can simply have a counter counting the number of sentences passed.

Yes, you can implement such counter and send a PR. That would be great (and more precise than my subword-based estimates that are biased because of not taking into account zero-padding).

@prigioni
Copy link

Where should I modify the code about set training steps?

@martinpopel
Copy link
Contributor Author

I don't know where is the exact location for adding the epoch counter (but I have not spent much time searching it), otherwise I would do it myself. Maybe it is possible to solve it with a hook in utils/trainer_lib.py. Note that tf.contrib.learn.Experiment is deprecated and should be replaced soon, but it seems that tf.estimator does not support continuous_train_and_eval. As this schedule is not intended for distributed train&eval anyway, I would suggest to get rid of tf.contrib.learn.Experiment and reimplement in pure Tensorflow, where it is much easier to count the number of epochs.

@DonPex
Copy link

DonPex commented Sep 6, 2018

@martinpopel With "number of training subwords" you mean the sum of all source texts subwords plus all target texts subwords used for training, is that right?

@martinpopel
Copy link
Contributor Author

@DonPex: No. It is the maximum of source and target subwords, for each sentence. See the discussion above.

@DonPex
Copy link

DonPex commented Sep 7, 2018

@martinpopel Thank you. I used your script to compute the maximum number of subwords, but you said that it should be only an estimate because of padding tokens.
So I should check this input_stats/inputs_nonpadding_fraction and input_stats/targets_nonpadding_fraction, multiplying them with the number of subwords to obtain the real number of subwords without padding?

I am using Google Colab, so I would know if it's possible to train a Transformer at least one epoch in 12 hours (maximum time allowed on Colab) with a custom dataset using a specific batch size.

@martinpopel
Copy link
Contributor Author

martinpopel commented Sep 7, 2018

Yes, considering nonpadding_fraction should result in a more precise estimate.
I am not sure why "at least one epoch" is important in your use case. Usually you need more epochs anyway for good results (unless the task is simple and data large, in which case you may overfit well before reaching one epoch).
If you can store checkpoints and continue training in another Colab session, then you can try it anyway (T2T starts from a random part of the training data and shuffles the training files by default, I think).

@DonPex
Copy link

DonPex commented Sep 7, 2018

My target is just to feed the model with the highest number of subwords in the dataset possible, so if I couldn't complete one epoch in less than 12h, I should use another Colab session and start the training from another random part, in this way I may skip some fractions of the dataset due to randomness.

@coder1248
Copy link

coder1248 commented Jan 7, 2020

epochs = steps * batch_size * worker_gpu / training_subwords

Yes, exactly. In other words epochs = steps * effective_batch_size / training_subwords.

I wrote a simple script t2t_text2subwords.py for computing the number of subwords in train/test data, but I had not enough time to tidy it, document and send as a PR.

@martinpopel may I ask something regarding the above formula? When training on a single TPU (v2), is the effective_batch_size equal to the batch_size, or to batch_size*8?
In other words, a single TPU has 8 cores. If batch_size is 2048, does it mean that each core handles 2048 (so effective_batch_size is 2048*8), or this 2048 is splitted between the cores?
Thank you!

@martinpopel
Copy link
Contributor Author

@coder1248: I would guess batch_size*8, but I am not sure as I have never used TPUs for real training. I know, T2T treats TPUs differently than CPU and GPU in several aspects (e.g. preferring/requiring fixed number of sentences per batch and "packed" problems, which perhaps influences also the estimate of number of epochs).

@coder1248
Copy link

Thanks again for your help martinpopel!
@rsepassi @lukaszkaiser could you kindly verify martinpopel's answer? Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants