Is there Any way to infer awq marlin model #26

DeJoker · 2024-07-23T07:18:14Z

First thanks to this awesome work with marlin kernel, curent I didn't find a way infer awq_marlin model, need help.

quant

I quant qwen2-72B with
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

Found that model.layers.0.self_attn.q_proj.qzeros does not exist with diff to other version.

infer with awq demo

with vllm

vllm-project/vllm#6612
I build vllm from current main source. Got error, with debugpy I found in layer model.layers.0.self_attn.q_proj.qzeros

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 87, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/model_runner.py", line 681, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/loader.py", line 278, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/models/qwen2.py", line 392, in load_weights
[rank0]:     weight_loader(param, loaded_weight)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/layers/linear.py", line 758, in weight_loader
[rank0]:     loaded_weight = loaded_weight.narrow(input_dim, start_idx,
[rank0]: RuntimeError: start (0) + length (29568) exceeds dimension size (1848).

try use offical demo

https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#transformers

first run I modify code, because there is no qzeros layer.

# awq/utils/fused_utils.py:155
        del (layer.qweight, layer.scales)
        if hasattr(layer, "qzeros"):
            del layer.qzeros

next, I got this error

AssertionError: Marlin kernels are not installed. Please install AWQ compatible Marlin kernels from AutoAWQ_kernels.

I cannot import marlin_cuda In this repo no this file.
but I found in gptq:
https://github.com/AutoGPTQ/AutoGPTQ/blob/main/autogptq_extension/marlin/marlin_cuda.cpp

Anyway I want to know a way to run model.

The text was updated successfully, but these errors were encountered:

casper-hansen · 2024-07-23T09:18:52Z

If you install the marlin kernels then you should be able to run inference in AutoAWQ. Otherwise, I would advise to quantize using GEMM because vLLM 0.5.3 now includes an automatic mapping to optimized Marlin kernels from that format.

https://github.com/IST-DASLab/marlin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there Any way to infer awq marlin model #26

Is there Any way to infer awq marlin model #26

DeJoker commented Jul 23, 2024

casper-hansen commented Jul 23, 2024

Is there Any way to infer awq marlin model #26

Is there Any way to infer awq marlin model #26

Comments

DeJoker commented Jul 23, 2024

quant

infer with awq demo

with vllm

try use offical demo

casper-hansen commented Jul 23, 2024