-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] 不支持qwen0.5b的加速?以及qwen0.5b的awq量化? #1870
Comments
Qwen2 0.5b is supported with PyTorch Engine. I tested it in the local env and encountered the performance issue compared with vLLM. @grimoire may you please take a look. |
Turbomind does not support Qwen2<=1.8b. And AWQ for the pytorch engine is WIP. |
make sense ref #1499 (comment) |
@grimoire By the way, is the priority supported by CUDA graph lower than that of torch.compile? The latter is currently the main recommended optimization method for PyTorch native. |
Pytorch does not support using custom triton kernel in torch.compile before 2.3.0. I will do some investigation on this. |
Checklist
Describe the bug
是否不支持qwen0.5b的加速?以及qwen0.5b的awq量化?
qwen0.5b 单卡t4 推理延时如下:
vllm:1.3s
lmdeploy:3.2s
Reproduction
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.9)
gen_config = GenerationConfig(top_p=0.95,
temperature=0.001,
max_new_tokens=512)
llm = pipeline(model_path,
backend_config=backend_config)
arrival_time = time.time()
llm_results = llm(prompts, gen_config=gen_config)
finished_time = time.time()
Environment
Error traceback
No response
The text was updated successfully, but these errors were encountered: