Compile compatibilty for decoder-only models #32617

zucchini-nlp · 2024-08-12T08:00:10Z

What does this PR do?

Recently we merged a few PRs deprecating old-style cache in all decoder-only models. This PR is a continuation of it, here we verify that all newly deprecated models can support static cache and are compatible with torch.compile. The main change is in RoPE to get rid of dynamic control flow

A few exception that cannot be supported yet: MoE models and some other with dynamic control flow like Phi3 or Chameleon.

Ran test_generate_compile_fullgraph and test_static_cache_matches_dynamic on all models + ran slow tests on models touched by this PR.

In the next PR I can start deprecating old cache in encoder-decoder models starting from Bart and GPT models

HuggingFaceDocBuilderDev · 2024-08-12T08:18:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante

Added a few comments, mostly about aligning with llama

Ran test_generate_compile_fullgraph and test_static_cache_matches_dynamic on all models + ran slow tests on models touched by this PR.

💛

src/transformers/generation/utils.py

src/transformers/models/bloom/modeling_bloom.py

gante · 2024-08-13T10:58:34Z

src/transformers/models/bloom/modeling_bloom.py

@@ -899,9 +895,24 @@ def prepare_inputs_for_generation(

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and cache_position[0] == 0:
-            model_inputs = {"inputs_embeds": inputs_embeds}
+            model_inputs = {"inputs_embeds": inputs_embeds, "input_ids": None}


missing #Copied from ... ?

not really, bloom has alibi and needs 2D attention for that. So we can't expand it to 4D, and choose to append zeros to attn to make it static shape.

src/transformers/models/falcon/modeling_falcon.py

src/transformers/models/gpt_neox/modeling_gpt_neox.py

src/transformers/models/mixtral/modeling_mixtral.py

src/transformers/models/phi/modeling_phi.py

src/transformers/models/qwen2_moe/modeling_qwen2_moe.py

src/transformers/models/stablelm/modeling_stablelm.py

src/transformers/models/starcoder2/modeling_starcoder2.py

zucchini-nlp · 2024-08-19T11:18:59Z

Updated with @gante comments and used the new RoPE modeling in all models. Ready for review!

zucchini-nlp · 2024-08-20T09:33:06Z

Failing tests are not related

ArthurZucker

💎 thanks so much for this tedious work, well done 🥳
What is left is to make sure the compile tests pass !

src/transformers/models/bloom/modeling_bloom.py

ArthurZucker · 2024-08-22T13:36:59Z

src/transformers/models/falcon/modeling_falcon.py

does it support compile ? (not seeing the supports_static_cache

yes, it does. You might have missed it :)

https://github.com/huggingface/transformers/blob/69b8a7be8bd3bcce6acd0de9876f141af5ebc983/src/transformers/models/falcon/modeling_falcon.py#L914C5-L914C27

ArthurZucker · 2024-08-22T13:39:55Z

src/transformers/models/gpt_neox_japanese/modeling_gpt_neox_japanese.py

@@ -273,9 +380,29 @@ def rotate_half(x):
    return torch.cat((-x2, x1), dim=-1)


-def apply_rotary_pos_emb(q, k, cos, sin, offset: int = 0):


this is potentially breaking no? (no more offset)

Hmm right, lemme check this

update: just verified we don't need to slice anymore, because we apply rope directly on the curretn position. Prev we applied Rope for all positions up to the current and had to slice out cached positions

ArthurZucker · 2024-08-22T13:47:35Z

src/transformers/models/starcoder2/modeling_starcoder2.py


        if past_key_value is not None:
            # Activate slicing cache only if the config has a value `sliding_windows` attribute
            cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
+            kv_seq_len = key_states.shape[-2] + cache_position[0]


i don't remember why we don't use cache_position[-1]

Because the last position is the whole past kv length, which causes incorrect length in pre-fill or uncached generation. Maybe we should switch to simply past_length = cache_position[-1] everywhere?

gante

Thank you for these very laborious changes 🙏

src/transformers/models/falcon/modeling_falcon.py

tests/models/falcon/test_modeling_falcon.py

zucchini-nlp · 2024-08-30T15:11:04Z

@simonJJJ I added the new RoPE embedding for Qwen2-VL in this PR. Since I changes Qwen2, the changes were automatically propagated with copy statements. I remember you had a PR to fix RoPE for FA2 can you check if the current version works as you expect?

zucchini-nlp · 2024-08-30T15:26:28Z

@ArthurZucker @gante changed deprecation to v4.46 and added qwen2-VL. Ran the tests again to check everything is okey. Let me know if you have any comments

zucchini-nlp · 2024-08-30T18:59:49Z

src/transformers/models/mistral/modeling_mistral.py

@@ -870,7 +870,7 @@ def _update_causal_mask(
        # to infer the attention mask.

        # cache_position must be valid here no matter which cache we use
-        past_seen_tokens = cache_position[0] if past_key_values is not None else 0
+        past_seen_tokens = past_key_values.get_seq_length() if past_key_values is not None else 0


Same as in Llama, using cache_position is a dynamic control flow which is not supported currently by compile. The fullgraph-compile test fails without this change

gante · 2024-09-06T09:38:27Z

@zucchini-nlp happy with the changes, feel free to merge! (given that you mentioned that you re-ran the tests 💛 )

zucchini-nlp · 2024-09-06T09:41:07Z

Yes, was exactly thinking to rebase main and re-ran tests one more time

zucchini-nlp · 2024-09-09T08:59:00Z

Test are passing, including slow. So, merging

anijain2305 · 2024-09-09T23:42:31Z

Can we update the tracker in #28981

* squash into one commit * add qwen2-vl for rope standardization * fix mistral compile * fix qwen2-vl * fix-copies

zucchini-nlp requested review from gante and ArthurZucker August 12, 2024 08:00

zucchini-nlp mentioned this pull request Aug 12, 2024

tracker: generate compatibility with torch.compile #28981

Open

32 tasks

gante reviewed Aug 13, 2024

View reviewed changes

zucchini-nlp mentioned this pull request Aug 14, 2024

Phi: static cache & compile compatibility #30688

Closed

zucchini-nlp requested a review from gante August 19, 2024 11:18

ArthurZucker reviewed Aug 22, 2024

View reviewed changes

ArthurZucker approved these changes Aug 22, 2024

View reviewed changes

gante approved these changes Aug 23, 2024

View reviewed changes

squash into one commit

1f328f0

zucchini-nlp force-pushed the compile-models branch from b3c91c0 to 1f328f0 Compare August 30, 2024 14:10

add qwen2-vl for rope standardization

00ca5a8

fix mistral compile

bcdf4d2

zucchini-nlp commented Aug 30, 2024

View reviewed changes

ArthurZucker mentioned this pull request Sep 6, 2024

What's going on with T5 x torch.compile ? #33221

Open

4 tasks

zucchini-nlp added 3 commits September 6, 2024 12:00

Merge remote-tracking branch 'upstream/main' into compile-models

fdda912

fix qwen2-vl

709ded9

fix-copies

6e14e09

zucchini-nlp merged commit 65bb284 into huggingface:main Sep 9, 2024
23 checks passed

irislin1006 mentioned this pull request Sep 10, 2024

solve device issue in cos and sin in Qwen2's ROPE Embedding #33391

Closed

5 tasks

itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024

Compile compatibilty for decoder-only models (huggingface#32617)

82a46ab

* squash into one commit * add qwen2-vl for rope standardization * fix mistral compile * fix qwen2-vl * fix-copies

niaoyu mentioned this pull request Sep 25, 2024

[BUG] Latest version cannot load Qwen2-VL model config correctly. #33401

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile compatibilty for decoder-only models #32617

Compile compatibilty for decoder-only models #32617

zucchini-nlp commented Aug 12, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 12, 2024

gante left a comment

gante Aug 13, 2024

zucchini-nlp Aug 14, 2024

zucchini-nlp commented Aug 19, 2024

zucchini-nlp commented Aug 20, 2024

ArthurZucker left a comment •

edited

Loading

ArthurZucker Aug 22, 2024

zucchini-nlp Aug 22, 2024

ArthurZucker Aug 22, 2024

zucchini-nlp Aug 22, 2024

zucchini-nlp Aug 30, 2024

ArthurZucker Aug 22, 2024

zucchini-nlp Aug 22, 2024

gante left a comment

zucchini-nlp commented Aug 30, 2024

zucchini-nlp commented Aug 30, 2024

zucchini-nlp Aug 30, 2024

gante commented Sep 6, 2024

zucchini-nlp commented Sep 6, 2024

zucchini-nlp commented Sep 9, 2024

anijain2305 commented Sep 9, 2024

		@@ -273,9 +380,29 @@ def rotate_half(x):
		return torch.cat((-x2, x1), dim=-1)


		def apply_rotary_pos_emb(q, k, cos, sin, offset: int = 0):

Compile compatibilty for decoder-only models #32617

Compile compatibilty for decoder-only models #32617

Conversation

zucchini-nlp commented Aug 12, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Aug 12, 2024

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 19, 2024

zucchini-nlp commented Aug 20, 2024

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 30, 2024

zucchini-nlp commented Aug 30, 2024

Choose a reason for hiding this comment

gante commented Sep 6, 2024

zucchini-nlp commented Sep 6, 2024

zucchini-nlp commented Sep 9, 2024

anijain2305 commented Sep 9, 2024

zucchini-nlp commented Aug 12, 2024 •

edited

Loading

ArthurZucker left a comment •

edited

Loading