[Bug] AWQ Model Fails Loading ADapter #1915

vladrad · 2024-07-03T16:06:44Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

When running the repo example I choose:
YurtsAI/Meta-Llama-3-8B-Instruct-AWQ model
and
traderpedroso/llama3-8b-lora this adapter.

I know the adapter was trained on the 4bit base model. Im not sure if this works with awq

    self.engine = Engine(model_path=model_path,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 153, in __init__
    _paging_adapters(adapters,
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 68, in _paging_adapters
    model_agent.paging_adapters(weight_maps)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 715, in paging_adapters
    weight_map.cache_adapter(lora_linears, cpu_caches)
  File "/home/merlin/code/kreacher/venv/lib/python3.10/site-packages/lmdeploy/pytorch/adapter/adapter.py", line 226, in cache_adapter
    assert len(lora_linears) == len(caches), (
AssertionError: len(lora_linears) == len(caches)

If I comment out len(lora_linears) == len(caches) then the adapter merges... but im not sure if that its supposed to work like that or not.

Reproduction

My script:

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048,
                                     adapters=dict(lora_name_1='traderpedroso/llama3-8b-lora'))
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('YurtsAI/Meta-Llama-3-8B-Instruct-AWQ',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': '您猜怎么着'
}]]
response = pipe(prompts, gen_config=gen_config, adapter_name='lora_name_1')
print(response)

Environment

Running latest version of LMDeploy.

Error traceback

No response

The text was updated successfully, but these errors were encountered:

lvhan028 · 2024-07-04T06:54:56Z

PytorchEngine的4bit推理，还在开发中：#1913

We are implementing the 4bit quantized model (awq quantization method) in pytorch engine (#1913). Stay tuned.

vladrad · 2024-07-04T14:58:13Z

Wow you all are fast

vladrad · 2024-07-04T17:16:27Z

Let me know if I can help out. Id be happy to test, im also capable of coding but this area is not my expertise :D . So this would mean any lora adapter should be able to mount on top of a AWQ quant model? Or do I need to fine-tune on a AWQ model. Seems like the lora adapter would just be mounted on top.

You all are amazing.

grimoire · 2024-07-08T02:53:59Z

#1913

PyTorchEngine use AwqLoraLinear, Adapters can be applied on awq model without fine-tune. Base linear would be forward with w4a16 support while adapters in fp16.

lvhan028 assigned grimoire Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] AWQ Model Fails Loading ADapter #1915

[Bug] AWQ Model Fails Loading ADapter #1915

vladrad commented Jul 3, 2024 •

edited

Loading

lvhan028 commented Jul 4, 2024 •

edited

Loading

vladrad commented Jul 4, 2024

vladrad commented Jul 4, 2024

grimoire commented Jul 8, 2024

[Bug] AWQ Model Fails Loading ADapter #1915

[Bug] AWQ Model Fails Loading ADapter #1915

Comments

vladrad commented Jul 3, 2024 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

Error traceback

lvhan028 commented Jul 4, 2024 • edited Loading

vladrad commented Jul 4, 2024

vladrad commented Jul 4, 2024

grimoire commented Jul 8, 2024

vladrad commented Jul 3, 2024 •

edited

Loading

lvhan028 commented Jul 4, 2024 •

edited

Loading