[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？ #1870

qism · 2024-06-27T06:59:03Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.

Describe the bug

是否不支持qwen0.5b的加速？以及qwen0.5b的awq量化？
qwen0.5b 单卡t4 推理延时如下：
vllm：1.3s
lmdeploy:3.2s

Reproduction

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.9)
gen_config = GenerationConfig(top_p=0.95,
temperature=0.001,
max_new_tokens=512)

llm = pipeline(model_path,
backend_config=backend_config)

arrival_time = time.time()
llm_results = llm(prompts, gen_config=gen_config)
finished_time = time.time()

Environment

lmdeploy 0.4.2

Error traceback

No response

zhyncs · 2024-07-01T05:49:15Z

Qwen2 0.5b is supported with PyTorch Engine. I tested it in the local env and encountered the performance issue compared with vLLM. @grimoire may you please take a look.

grimoire · 2024-07-01T07:31:09Z

Turbomind does not support Qwen2<=1.8b. And AWQ for the pytorch engine is WIP.
The problem is that Qwen2 0.5b doesn't have enough GPU computation to hide the kernel launch overhead. CUDAGraph might be a way, but all models and kernels need to be redesigned to support it. Not to mention that we also plan to support non-nvidia devices.

zhyncs · 2024-07-01T07:39:39Z

Turbomind does not support Qwen2<=1.8b. And AWQ for the pytorch engine is WIP. The problem is that Qwen2 0.5b doesn't have enough GPU computation to hide the kernel launch overhead. CUDAGraph might be a way, but all models and kernels need to be redesigned to support it. Not to mention that we also plan to support non-nvidia devices.

make sense ref #1499 (comment)
And I am really looking forward to running the PyTorch Engine on AMD GPU.

zhyncs · 2024-07-01T07:41:17Z

@grimoire By the way, is the priority supported by CUDA graph lower than that of torch.compile? The latter is currently the main recommended optimization method for PyTorch native.

grimoire · 2024-07-01T07:57:08Z

Pytorch does not support using custom triton kernel in torch.compile before 2.3.0. I will do some investigation on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？ #1870

[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？ #1870

qism commented Jun 27, 2024

zhyncs commented Jul 1, 2024

grimoire commented Jul 1, 2024

zhyncs commented Jul 1, 2024

zhyncs commented Jul 1, 2024

grimoire commented Jul 1, 2024

[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？ #1870

[Bug] 不支持qwen0.5b的加速？以及qwen0.5b的awq量化？ #1870

Comments

qism commented Jun 27, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

zhyncs commented Jul 1, 2024

grimoire commented Jul 1, 2024

zhyncs commented Jul 1, 2024

zhyncs commented Jul 1, 2024

grimoire commented Jul 1, 2024