Fixes Checkpointing #5220

dirkgr · 2021-05-24T17:55:49Z

Checkpointing and restarting from a checkpoint now works when the training job is interrupted half-way through an epoch.
The checkpointer is no longer responsible for writing out the current best model. The trainer has to do this now.
GradientDescentTrainer now lives in its own file. I had to do this to break a circular dependency between Checkpointer and GradientDescentTrainer.
Callbacks can now save and restore state.
When training with moving average, restoring checkpoints now works correctly.
When re-starting an interrupted training job, the trainer will now read out the data loader even for epochs and batches that can be skipped. This is necessary to ensure that any random number generators used by the reader or data loader are in the same state as they were the first time the training job ran.

lol

dirkgr · 2021-05-24T21:36:57Z

@epwalsh, you can look at this now, while I'm fixing tests. What do we need to change to make fairscale work?

epwalsh · 2021-05-24T21:58:28Z

In a meeting now but I'll take a look afterwards

epwalsh

These are great improvements, but they don't really change the story with FairScale.

One thing that's missing is synchronization across distributed workers when gathering model and training state, since collecting the state associated with sharded parameters requires a distributed gather operation (each worker needs to send its shard of the data to the main process).

Another issue is that the optimizer state actually has to be collected through the FullyShardedDataParallel model wrapper (gather_full_optim_state).

epwalsh · 2021-05-24T22:40:27Z

allennlp/training/checkpointer.py

+        save_completed_epochs: bool = True,
+        save_every_num_seconds: Optional[int] = None,
+        save_every_num_batches: Optional[int] = None,
+        keep_most_recent_by_count: Optional[int] = 2,
+        keep_most_recent_by_age: Optional[int] = None,


Thank you, I hated the old names 💯

epwalsh · 2021-05-24T22:50:30Z

allennlp/training/__init__.py

-    GradientDescentTrainer,
-)
+from allennlp.training.trainer import Trainer
+from allennlp.training.gradient_descent_trainer import GradientDescentTrainer


Nice. I've been wanting to move this to it's own file for a file.

epwalsh · 2021-05-24T22:50:46Z

allennlp/training/callbacks/callback.py

+    def state_dict(self) -> Dict[str, Any]:
+        return {}
+
+    def load_state_dict(self, state_dict: Dict[str, Any]) -> None:
+        pass


dirkgr · 2021-05-24T23:14:51Z

Deep in the throes of fixing all the tests, I'm wondering if I should have fixed this backwards. Saving and restoring in the middle of an epoch was added to the checkpointer, but it's completely unsupported by any other part of the system. This is essentially a new piece of functionality.

dirkgr · 2021-05-25T22:53:14Z

Tests pass locally. I'm still fighting with mypy and the models repo. We might have to retrain some stuff (or at least patch the model configs), because the num_serialized_models_to_keep parameter went away.

But overall, this is ready to review.

dirkgr · 2021-05-25T22:53:57Z

.github/workflows/ci.yml

@@ -152,6 +152,7 @@ jobs:
      run: |
        git clone https://github.com/allenai/allennlp-models.git
        cd allennlp-models
+        git checkout Checkpointing


Suggested change

git checkout Checkpointing

This will have to be removed before merging

dirkgr · 2021-05-25T22:58:28Z

GradientDescentTrainer is by and large the same. While reviewing, only look at the bits that have to do with checkpointing, and the _start_after_* variables.

You can also review this one commit at a time. I kept the commits pretty clean and self contained. That'll let you skip the big copy of GradientDescentTrainer.

dirkgr · 2021-05-25T23:35:48Z

We should do a minor version bump after this. It changes some public APIs.

dirkgr · 2021-05-27T17:14:30Z

@epwalsh, this is ready for a real review now.

epwalsh

This looks great. I just left a few comments.

epwalsh · 2021-05-28T23:53:23Z

allennlp/models/archival.py

+    extra_copy_of_weights_just_for_mypy = Path(weights)
+    if extra_copy_of_weights_just_for_mypy.is_absolute():
+        weights_file = extra_copy_of_weights_just_for_mypy
+    else:
+        weights_file = Path(serialization_dir) / extra_copy_of_weights_just_for_mypy


This is a little confusing. How about just use typing.cast?

serialization_dir can be a str at the time. It's not just to let mypy know what it is.

epwalsh · 2021-05-28T23:55:03Z

allennlp/training/checkpointer.py

+    save_completed_epochs : `bool`, (default=`True`)
+        Saves model and trainer state at the end of each completed epoch.
+    save_every_num_seconds : `int`, optional (default=`None`)
+        If set, makes sure we never go longer than this number of seconds between saving a model.
+    save_every_num_batches : `int`, optional (default=`None`)
+        If set, makes sure we never go longer than this number of batches between saving a model.
+    keep_most_recent_by_count : `int`, optional (default=`2`)
+        Sets the number of model checkpoints to keep on disk. If both `keep_most_recent_by_count` and
+        `keep_most_recent_by_age` are set, we'll keep checkpoints that satisfy either criterion.
+        If both are `None`, we keep all checkpoints.
+    keep_most_recent_by_age : `int`, optional (default=`None`)
+        Sets the number of seconds we'll keep a checkpoint before deleting it. If both
+        `keep_most_recent_by_count` and `keep_most_recent_by_age` are set, we'll keep checkpoints
+        that satisfy either criterion. If both are `None`, we keep all checkpoints.


Nice, this is much more clear.

Unfortunately it breaks backwards compatibility. Worth it, I think, but not great.

allennlp/training/checkpointer.py

epwalsh · 2021-05-29T00:10:30Z

CHANGELOG.md

@@ -40,6 +41,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 - When `PretrainedTransformerIndexer` folds long sequences, it no longer loses the information from token type ids.
 - Fixed documentation for `GradientDescentTrainer.cuda_device`.
+- Re-starting a training run from a checkpoint in the middle of an epoch now works correctly.
+- When using the "moving average" weights smoothing feature of the trainer, training checkpoints would also get smoothed, with strange results for resuming a training job. This has been fixed.
+- When re-starting an interrupted training job, the trainer will now read out the data loader even for epochs and batches that can be skipped. This ensures that any random number generators used by the reader or data loader are in the same state as they were the first time the training job ran.


This sounds good, in theory, but there are probably other things that affect the random number generators used by the reader and data loader. I don't think we can guarantee the same order.

Hmm. I wrote it this way because in Quark it worked out that way. I had good enough control over the RNGs that it was deterministic.

In AllenNLP, we can't guarantee that none of the things we're skipping when restoring from a checkpoint (the forward() method for example) modify the RNG state. I guess I'll say that this is an attempt to ensure deterministic randomness, but does not guarantee it. At the same time, we should encourage components to use their own RNG instead of using the global one, so they don't affect each other.

It's actually quite bad if this doesn't work. If we don't guarantee the order of instances, and we stop training 10 times in the middle of an epoch and restart it, we might end up training on the same instance 10 times.

11 times even

Co-authored-by: Pete <petew@allenai.org>

* Removes unused variable * Formatting * Make sure we always restore the model's weights properly * Give TrainerCallbacks the ability to save and load state dicts * Give MovingAverage the ability to save and load state dicts * Do not set gradients to None * Typo * Remove unused variable * Typo * Entirely new checkpointing code * Formatting * Make mypy happy lol * Makes the no-op trainer work with the new checkpointer * Mark epochs as completed when they're skipped * Changelog * Fixes how we get the best weights after a training run * Mypy is annoying * Callback fixes * Fix the no op trainer * Simplify * Assorted checkpointer fixes * Mypy is now happy * Fixed all the tests except for one * Removed unused variable * Fix trainer restore logic * Fix test for trainer restore logic * Check the Checkpointing branch of the models repo * Help mypy along * Fixed finalizing logic * More mypy stuff * Update allennlp/training/checkpointer.py Co-authored-by: Pete <petew@allenai.org> * Make weaker claims Co-authored-by: Pete <petew@allenai.org>

dirkgr added 13 commits May 19, 2021 16:50

Removes unused variable

fb6b00b

Formatting

7d15a73

Make sure we always restore the model's weights properly

1475729

Give TrainerCallbacks the ability to save and load state dicts

f247cdd

Give MovingAverage the ability to save and load state dicts

fd14dd7

Do not set gradients to None

27d90d0

Typo

7612741

Remove unused variable

d894b25

Typo

e0e917e

Entirely new checkpointing code

e52e7ad

Formatting

bead35e

Merge remote-tracking branch 'origin/main' into Checkpointing

c5e8537

Make mypy happy

6128a88

lol

dirkgr added 2 commits May 24, 2021 15:03

Makes the no-op trainer work with the new checkpointer

cef052d

Mark epochs as completed when they're skipped

fdc8db7

epwalsh reviewed May 24, 2021

View reviewed changes

dirkgr added 11 commits May 24, 2021 16:23

Changelog

b3b65c2

Fixes how we get the best weights after a training run

6c944c2

Mypy is annoying

0799137

Callback fixes

3cae291

Fix the no op trainer

e04eae2

Simplify

d873df1

Assorted checkpointer fixes

5480fd5

Mypy is now happy

eafb48a

Fixed all the tests except for one

37c13a7

Removed unused variable

872ac20

Fix trainer restore logic

b8545fd

dirkgr added 4 commits May 25, 2021 15:29

Fix test for trainer restore logic

273386d

Merge remote-tracking branch 'origin/main' into Checkpointing

abc7826

Check the Checkpointing branch of the models repo

69909ff

Help mypy along

4ca4912

dirkgr marked this pull request as ready for review May 25, 2021 22:52

dirkgr commented May 25, 2021

View reviewed changes

dirkgr mentioned this pull request May 25, 2021

Checkpointing allenai/allennlp-models#269

Merged

dirkgr added 3 commits May 25, 2021 16:10

Fixed finalizing logic

7af394f

More mypy stuff

302c672

Merge branch 'main' into Checkpointing

cd33ad2

Merge branch 'main' into Checkpointing

153608b

Merge branch 'main' into Checkpointing

9f97c38

dirkgr self-assigned this May 27, 2021

dirkgr added 2 commits May 27, 2021 11:26

Merge branch 'main' into Checkpointing

f95d437

Merge branch 'main' into Checkpointing

7811679

epwalsh approved these changes May 29, 2021

View reviewed changes

dirkgr and others added 3 commits May 28, 2021 18:50

Update allennlp/training/checkpointer.py

82353ed

Co-authored-by: Pete <petew@allenai.org>

Merge remote-tracking branch 'origin/main' into Checkpointing

5537194

Make weaker claims

5aa9b00

dirkgr enabled auto-merge (squash) May 29, 2021 02:03

dirkgr merged commit c5bff8b into main May 29, 2021

dirkgr deleted the Checkpointing branch May 29, 2021 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes Checkpointing #5220

Fixes Checkpointing #5220

dirkgr commented May 24, 2021 •

edited

Loading

dirkgr commented May 24, 2021

epwalsh commented May 24, 2021

epwalsh left a comment

epwalsh May 24, 2021

epwalsh May 24, 2021

epwalsh May 24, 2021

dirkgr commented May 24, 2021

dirkgr commented May 25, 2021 •

edited

Loading

dirkgr May 25, 2021

dirkgr commented May 25, 2021

dirkgr commented May 25, 2021

dirkgr commented May 27, 2021

epwalsh left a comment

epwalsh May 28, 2021

dirkgr May 29, 2021

epwalsh May 28, 2021

dirkgr May 29, 2021

epwalsh May 29, 2021

dirkgr May 29, 2021

dirkgr May 29, 2021

dirkgr May 29, 2021

Fixes Checkpointing #5220

Fixes Checkpointing #5220

Conversation

dirkgr commented May 24, 2021 • edited Loading

dirkgr commented May 24, 2021

epwalsh commented May 24, 2021

epwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented May 24, 2021

dirkgr commented May 25, 2021 • edited Loading

Choose a reason for hiding this comment

dirkgr commented May 25, 2021

dirkgr commented May 25, 2021

dirkgr commented May 27, 2021

epwalsh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented May 24, 2021 •

edited

Loading

dirkgr commented May 25, 2021 •

edited

Loading