Add Gemma Support #393

TechxGenus · 2024-03-10T04:51:28Z

Add latest google gemma model.

casper-hansen · 2024-03-10T17:04:31Z

Hi @TechxGenus, great to see Gemma support. I tested your code and the quantization seems to work, although I have some issues measuring perplexity on the Gemma model series in general.

I am getting some odd sizes for the model once saved (6GB shard + 600MB shard):

-rw-rw-rw-  1 root root 6558499704 Mar 10 16:18 model-00001-of-00002.safetensors
-rw-rw-rw-  1 root root  614576896 Mar 10 16:18 model-00002-of-00002.safetensors

However, I tested the fused modules and it seems that I get the following error:

Traceback (most recent call last):
  File "/workspace/AutoAWQ/examples/generate.py", line 29, in <module>
    generation_output = model.generate(
  File "/workspace/AutoAWQ/awq/models/base.py", line 111, in generate
    return self.model.generate(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1544, in generate
    return self.greedy_search(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2404, in greedy_search
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/gemma/modeling_gemma.py", line 1073, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/AutoAWQ/awq/modules/fused/model.py", line 119, in forward
    h, _, past_key_value = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/AutoAWQ/awq/modules/fused/block.py", line 113, in forward
    attn_output, _, past_key_value = self.attn.forward(
  File "/workspace/AutoAWQ/awq/modules/fused/attn.py", line 198, in forward
    xqkv = xqkv.view((bsz, seqlen) + self.attention_shapes["xqkv_view"])
RuntimeError: shape '[1, 47, 48, 192]' is invalid for input of size 577536

TechxGenus · 2024-03-10T19:02:01Z

Yes, the quantized model file size is odd. This may be related to Google's design as Gemma has a very large embedding layer.
I cannot seem to reproduce this error. I used the quantized gemma-2b-it model to run examples/generate.py and got the following results:

Replacing layers...: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:01<00:00, 14.85it/s]
Fusing layers...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 106.18it/s]


The answer is in the center of the Earth.

The statement is a trick question, as the person is standing on the surface of the Earth and walking one mile south, one mile west and one mile north will not change their position.

It looks correct.

TechxGenus · 2024-03-11T04:28:13Z

I reproduced this error when running gemma-7b-it-AWQ, though gemma-2b-AWQ works well.

Additionally, I discovered that the latest transformers seem to modify the implementation of model.generate, and previous fusion layers needed to be modified to work. I tested TheBloke/Llama-2-7B-AWQ and after modification its output was consistent with not using the fusion layer (more testing needed).

TechxGenus · 2024-03-11T07:22:24Z

I fixed the error and should be able to generate results normally now.

casper-hansen · 2024-03-11T14:15:06Z

Excellent work @TechxGenus. Thanks for your contribution.

TechxGenus added 3 commits March 8, 2024 15:18

add gemma support

81c41c4

Fix RMSNorm for Gemma

252c85b

Update Gemma Fuser

cbb3290

Fix past_key_values for LlamaLikeModel

f4fc4ff

Fix head dim bug and refactor

d1f115d

casper-hansen merged commit 94e73f0 into casper-hansen:main Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma Support #393

Add Gemma Support #393

TechxGenus commented Mar 10, 2024

casper-hansen commented Mar 10, 2024 •

edited

Loading

TechxGenus commented Mar 10, 2024

TechxGenus commented Mar 11, 2024

TechxGenus commented Mar 11, 2024

casper-hansen commented Mar 11, 2024

Add Gemma Support #393

Add Gemma Support #393

Conversation

TechxGenus commented Mar 10, 2024

casper-hansen commented Mar 10, 2024 • edited Loading

TechxGenus commented Mar 10, 2024

TechxGenus commented Mar 11, 2024

TechxGenus commented Mar 11, 2024

casper-hansen commented Mar 11, 2024

casper-hansen commented Mar 10, 2024 •

edited

Loading