AWQ: Implement new kernels (64% faster decoding) #3025

casper-hansen · 2024-02-24T23:05:05Z

According to my testing, it's possible to get even faster decoding than if you were to use ExLlamaV2 kernels. The prefilling speed is roughly the same as the current GEMM kernels (including the dequantize + torch.matmul trick).

Reference:
casper-hansen/AutoAWQ#365

simon-mo · 2024-02-26T21:28:37Z

PR welcomed! (or is there existing ones with ExLlamaV2?)

casper-hansen · 2024-02-26T21:54:47Z

PR welcomed! (or is there existing ones with ExLlamaV2?)

This is not about EXLV2 - my PR was just showcasing 64% faster decoding at batch size 32.

I am first looking to distribute models on HF before making any PR myself. This is essentially AWQ kernels version 2.0.

isaac-vidas · 2024-02-28T20:46:48Z

Would be great to be load these new AWQ models in vLLM.
I tried a quantized version of LLaVA 1.5 in with the demo in https://github.com/mit-han-lab/llm-awq and the improvement is substantial.

@casper-hansen are there any pointers in how to load these new quantized models after converting the checkpoint to HF models? Perhaps other can contribute as well.

hmellor · 2024-09-20T20:43:21Z

@casper-hansen is #3289 the PR implementing what you mention in this issue?

casper-hansen · 2024-09-20T21:33:39Z

Yeah; but I couldn’t get it to work in vLLM. It just generates empty output, so I abandoned it for now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ: Implement new kernels (64% faster decoding) #3025

AWQ: Implement new kernels (64% faster decoding) #3025

casper-hansen commented Feb 24, 2024

simon-mo commented Feb 26, 2024

casper-hansen commented Feb 26, 2024

isaac-vidas commented Feb 28, 2024 •

edited by linear bot

Loading

hmellor commented Sep 20, 2024

casper-hansen commented Sep 20, 2024

AWQ: Implement new kernels (64% faster decoding) #3025

AWQ: Implement new kernels (64% faster decoding) #3025

Comments

casper-hansen commented Feb 24, 2024

simon-mo commented Feb 26, 2024

casper-hansen commented Feb 26, 2024

isaac-vidas commented Feb 28, 2024 • edited by linear bot Loading

hmellor commented Sep 20, 2024

casper-hansen commented Sep 20, 2024

isaac-vidas commented Feb 28, 2024 •

edited by linear bot

Loading