[Flax] Add general conversion script #10809

patrickvonplaten · 2021-03-19T13:25:34Z

What does this PR do?

This PR changes the weight architecture of FlaxBertModel so that it corresponds 1-to-1 to PyTorch's version of BertModel. This means that some weights had to be renamed (e.g. "layer_norm" -> "LayerNorm" since PyTorch uses "LayerNorm") and also some new flax.linen.Modules, such as FlaxBertSelfOutput had to be created.

As can be seen, the PT=>Flax conversion function is now kept very general and can be applied to all models so that we can fully delete any model-specific conversion logic.

The PR has one drawback however:

Flax official SelfAttention Module cannot be used anymore since it doesn't give us enough flexibility to convert PyTorch weights to flax weights without having a model-specific conversion function. FlaxBERT's new attention modules fully correspond to PyTorchBERT's attention modules and are IMO still kept quite short by relying on Flax's dot_product_attention function. Another drawback is that for auto-regressive Transformers models we will have to manually add all the code corresponding to cached / auto-regressive attention to the attention module (which we do for PyTorch anyways) instead of being able to use already existing code of nn.linen.SelfAttention -> see here: https://github.com/google/flax/blob/e31063da71bd7a4df137b000df6a48b0cea35a2b/flax/linen/attention.py#L202.

All in all, rewriting parts of flax.linen.SelfAttention is the right choice here though because it allows us to have a much cleaner conversion function with very little downside IMO (slightly higher maintenance because we need to copy-paste a bit more code).

@LysandreJik @sgugger - could you check if you agree more or less with my solution here (below I left some comments to showcase the trade-offs a bit better). I'll clean the code & upload the new weight structure then :-)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

patrickvonplaten · 2021-03-23T21:44:16Z

src/transformers/modeling_flax_pytorch_utils.py

+    )
+
+    # Need to change some parameters name to match Flax names so that we don't have to fork any layer
+    for pt_key, pt_tensor in pt_state_dict.items():


The conversion function can be kept short & concise by forcing FlaxBert to have the exact same model names and architecture as PyTorch's BERT

patrickvonplaten · 2021-03-23T21:44:59Z

src/transformers/modeling_flax_utils.py

@@ -121,11 +122,6 @@ def params(self, params: Union[Dict, FrozenDict]):
            )
        self._params = freeze(params)

-    @staticmethod


We can delete all model-specific conversion methods now :-)

patrickvonplaten · 2021-03-23T21:45:32Z

src/transformers/modeling_flax_pytorch_utils.py

+        elif pt_tuple_key[-1] == "beta":
+            pt_tuple_key = pt_tuple_key[:-1] + ("bias",)
+
+        # THIS AND MORE WOULD BE NEEDED IF ATTENTION FN IS USED


This and much more code would have to be added if we would decide to stick with flax.linen.SelfAttention

patrickvonplaten · 2021-03-23T21:46:18Z