Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there Any way to infer awq marlin model #26

Open
DeJoker opened this issue Jul 23, 2024 · 1 comment
Open

Is there Any way to infer awq marlin model #26

DeJoker opened this issue Jul 23, 2024 · 1 comment

Comments

@DeJoker
Copy link

DeJoker commented Jul 23, 2024

First thanks to this awesome work with marlin kernel, curent I didn't find a way infer awq_marlin model, need help.

quant

I quant qwen2-72B with
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

Found that model.layers.0.self_attn.q_proj.qzeros does not exist with diff to other version.

infer with awq demo

with vllm

vllm-project/vllm#6612
I build vllm from current main source. Got error, with debugpy I found in layer model.layers.0.self_attn.q_proj.qzeros

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 87, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/model_runner.py", line 681, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/loader.py", line 278, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/models/qwen2.py", line 392, in load_weights
[rank0]:     weight_loader(param, loaded_weight)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/layers/linear.py", line 758, in weight_loader
[rank0]:     loaded_weight = loaded_weight.narrow(input_dim, start_idx,
[rank0]: RuntimeError: start (0) + length (29568) exceeds dimension size (1848).

try use offical demo

https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#transformers

first run I modify code, because there is no qzeros layer.

# awq/utils/fused_utils.py:155
        del (layer.qweight, layer.scales)
        if hasattr(layer, "qzeros"):
            del layer.qzeros

next, I got this error

AssertionError: Marlin kernels are not installed. Please install AWQ compatible Marlin kernels from AutoAWQ_kernels.

I cannot import marlin_cuda In this repo no this file.
but I found in gptq:
https://github.com/AutoGPTQ/AutoGPTQ/blob/main/autogptq_extension/marlin/marlin_cuda.cpp

Anyway I want to know a way to run model.

@casper-hansen
Copy link
Owner

If you install the marlin kernels then you should be able to run inference in AutoAWQ. Otherwise, I would advise to quantize using GEMM because vLLM 0.5.3 now includes an automatic mapping to optimized Marlin kernels from that format.

https://github.com/IST-DASLab/marlin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants