Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ: Implement new kernels (64% faster decoding) #3025

Open
casper-hansen opened this issue Feb 24, 2024 · 5 comments
Open

AWQ: Implement new kernels (64% faster decoding) #3025

casper-hansen opened this issue Feb 24, 2024 · 5 comments

Comments

@casper-hansen
Copy link
Contributor

According to my testing, it's possible to get even faster decoding than if you were to use ExLlamaV2 kernels. The prefilling speed is roughly the same as the current GEMM kernels (including the dequantize + torch.matmul trick).

Reference:
casper-hansen/AutoAWQ#365

@simon-mo
Copy link
Collaborator

PR welcomed! (or is there existing ones with ExLlamaV2?)

@casper-hansen
Copy link
Contributor Author

PR welcomed! (or is there existing ones with ExLlamaV2?)

This is not about EXLV2 - my PR was just showcasing 64% faster decoding at batch size 32.

I am first looking to distribute models on HF before making any PR myself. This is essentially AWQ kernels version 2.0.

@isaac-vidas
Copy link

isaac-vidas commented Feb 28, 2024

Would be great to be load these new AWQ models in vLLM.
I tried a quantized version of LLaVA 1.5 in with the demo in https://github.com/mit-han-lab/llm-awq and the improvement is substantial.

@casper-hansen are there any pointers in how to load these new quantized models after converting the checkpoint to HF models? Perhaps other can contribute as well.

@hmellor
Copy link
Collaborator

hmellor commented Sep 20, 2024

@casper-hansen is #3289 the PR implementing what you mention in this issue?

@casper-hansen
Copy link
Contributor Author

Yeah; but I couldn’t get it to work in vLLM. It just generates empty output, so I abandoned it for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants