-
-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWQ: Implement new kernels (64% faster decoding) #3025
Comments
PR welcomed! (or is there existing ones with ExLlamaV2?) |
This is not about EXLV2 - my PR was just showcasing 64% faster decoding at batch size 32. I am first looking to distribute models on HF before making any PR myself. This is essentially AWQ kernels version 2.0. |
Would be great to be load these new AWQ models in vLLM. @casper-hansen are there any pointers in how to load these new quantized models after converting the checkpoint to HF models? Perhaps other can contribute as well. |
@casper-hansen is #3289 the PR implementing what you mention in this issue? |
Yeah; but I couldn’t get it to work in vLLM. It just generates empty output, so I abandoned it for now |
According to my testing, it's possible to get even faster decoding than if you were to use ExLlamaV2 kernels. The prefilling speed is roughly the same as the current GEMM kernels (including the dequantize + torch.matmul trick).
Reference:
casper-hansen/AutoAWQ#365
The text was updated successfully, but these errors were encountered: