Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] AWQ Model Fails Loading ADapter #1915

Open
1 of 2 tasks
vladrad opened this issue Jul 3, 2024 · 4 comments
Open
1 of 2 tasks

[Bug] AWQ Model Fails Loading ADapter #1915

vladrad opened this issue Jul 3, 2024 · 4 comments
Assignees

Comments

@vladrad
Copy link

vladrad commented Jul 3, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

When running the repo example I choose:
YurtsAI/Meta-Llama-3-8B-Instruct-AWQ model
and
traderpedroso/llama3-8b-lora this adapter.

I know the adapter was trained on the 4bit base model. Im not sure if this works with awq

    self.engine = Engine(model_path=model_path,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 153, in __init__
    _paging_adapters(adapters,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 68, in _paging_adapters
    model_agent.paging_adapters(weight_maps)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 715, in paging_adapters
    weight_map.cache_adapter(lora_linears, cpu_caches)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/adapter/adapter.py", line 226, in cache_adapter
    assert len(lora_linears) == len(caches), (
AssertionError: len(lora_linears) == len(caches)

If I comment out len(lora_linears) == len(caches) then the adapter merges... but im not sure if that its supposed to work like that or not.

Reproduction

My script:

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048,
                                     adapters=dict(lora_name_1='traderpedroso/llama3-8b-lora'))
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('YurtsAI/Meta-Llama-3-8B-Instruct-AWQ',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': '您猜怎么着'
}]]
response = pipe(prompts, gen_config=gen_config, adapter_name='lora_name_1')
print(response)

Environment

Running latest version of LMDeploy.

Error traceback

No response

@lvhan028
Copy link
Collaborator

lvhan028 commented Jul 4, 2024

PytorchEngine的4bit推理,还在开发中:#1913

We are implementing the 4bit quantized model (awq quantization method) in pytorch engine (#1913). Stay tuned.

@vladrad
Copy link
Author

vladrad commented Jul 4, 2024

Wow you all are fast

@vladrad
Copy link
Author

vladrad commented Jul 4, 2024

Let me know if I can help out. Id be happy to test, im also capable of coding but this area is not my expertise :D . So this would mean any lora adapter should be able to mount on top of a AWQ quant model? Or do I need to fine-tune on a AWQ model. Seems like the lora adapter would just be mounted on top.

You all are amazing.

@grimoire
Copy link
Collaborator

grimoire commented Jul 8, 2024

#1913

PyTorchEngine use AwqLoraLinear, Adapters can be applied on awq model without fine-tune. Base linear would be forward with w4a16 support while adapters in fp16.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants