Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ small batches optimization #1707

Open
zhyncs opened this issue Jun 4, 2024 · 4 comments
Open

AWQ small batches optimization #1707

zhyncs opened this issue Jun 4, 2024 · 4 comments
Assignees

Comments

@zhyncs
Copy link
Collaborator

zhyncs commented Jun 4, 2024

Motivation

Recently, Tsinghua University proposed a survey related to LLM inference acceleration, comparing TensorRT LLM and LMDeploy under AWQ. From the results, LMDeploy has a higher speed-ups for large batches, while TensorRT LLM has a higher speed-ups for small batches. Previously, LMDeploy had been optimizing throughput and internet companies still pay attention to throughput under limited latency during actual online serving. Therefore, optimizing the latency acceleration ratio for small batches might also be meaningful. If interested, you may take a look. Cheers. @lzhangzz @irexyc @lvhan028 @grimoire

https://arxiv.org/pdf/2404.14294

image

Related resources

No response

Additional context

No response

@lzhangzz
Copy link
Collaborator

lzhangzz commented Jun 4, 2024

The second iteration of our mixed precision GEMM will be ready soon, stay tuned.

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jun 4, 2024

The second iteration of our mixed precision GEMM will be ready soon, stay tuned.

Cheers!

@coolhok
Copy link

coolhok commented Jun 11, 2024

how about suppourt group_size == 64,so can enable awq int4 model TP = 2。

@zhyncs
Copy link
Collaborator Author

zhyncs commented Jul 3, 2024

suppourt group_size == 64

This requirement has also been mentioned several times in the discussion of other issues, and @lzhangzz ‘s expectation was to support it around July. Perhaps it could also change, after all, plans cannot keep up with changes, and there are many higher priority matters at hand now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants