Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] 不支持qwen0.5b的加速?以及qwen0.5b的awq量化? #1870

Open
2 tasks
qism opened this issue Jun 27, 2024 · 5 comments
Open
2 tasks

[Bug] 不支持qwen0.5b的加速?以及qwen0.5b的awq量化? #1870

qism opened this issue Jun 27, 2024 · 5 comments

Comments

@qism
Copy link

qism commented Jun 27, 2024

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.

Describe the bug

是否不支持qwen0.5b的加速?以及qwen0.5b的awq量化?
qwen0.5b 单卡t4 推理延时如下:
vllm:1.3s
lmdeploy:3.2s

Reproduction

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.9)
gen_config = GenerationConfig(top_p=0.95,
temperature=0.001,
max_new_tokens=512)

llm = pipeline(model_path,
backend_config=backend_config)

arrival_time = time.time()
llm_results = llm(prompts, gen_config=gen_config)
finished_time = time.time()

Environment

lmdeploy 0.4.2

Error traceback

No response

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 1, 2024

Qwen2 0.5b is supported with PyTorch Engine. I tested it in the local env and encountered the performance issue compared with vLLM. @grimoire may you please take a look.

@grimoire
Copy link
Collaborator

grimoire commented Jul 1, 2024

Turbomind does not support Qwen2<=1.8b. And AWQ for the pytorch engine is WIP.
The problem is that Qwen2 0.5b doesn't have enough GPU computation to hide the kernel launch overhead. CUDAGraph might be a way, but all models and kernels need to be redesigned to support it. Not to mention that we also plan to support non-nvidia devices.

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 1, 2024

Turbomind does not support Qwen2<=1.8b. And AWQ for the pytorch engine is WIP. The problem is that Qwen2 0.5b doesn't have enough GPU computation to hide the kernel launch overhead. CUDAGraph might be a way, but all models and kernels need to be redesigned to support it. Not to mention that we also plan to support non-nvidia devices.

make sense ref #1499 (comment)
And I am really looking forward to running the PyTorch Engine on AMD GPU.

@zhyncs
Copy link
Collaborator

zhyncs commented Jul 1, 2024

@grimoire By the way, is the priority supported by CUDA graph lower than that of torch.compile? The latter is currently the main recommended optimization method for PyTorch native.

@grimoire
Copy link
Collaborator

grimoire commented Jul 1, 2024

Pytorch does not support using custom triton kernel in torch.compile before 2.3.0. I will do some investigation on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants