Idea: Dequantize + Fused MoE #323

casper-hansen · 2024-01-26T10:14:24Z

Based on the amazing work by @zwd003 and @pcmoritz, a potential strategy for a speedup in AutoAWQ could be to run dequantization and then Fused MoE. I have my doubts if this will give a speedup during decoding, but it's worth a shot since the Triton kernel is already developed and requires minimal effort to integrate.

Ideally, we could dequantize during the Triton kernel to remove any additional overhead and make it as optimized as possible. However, that does require integrating the dequantization code with the Fused MoE code.

Fused MoE kernel: vllm-project/vllm#2542
AWQ Triton Dequantization: https://github.com/vllm-project/vllm/blob/qmm/vllm/model_executor/layers/quantization/ops/awq.py

chu-tianxiang · 2024-02-04T02:39:30Z

Have you got any result about the performance please? I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested.

casper-hansen · 2024-02-05T10:49:22Z

Have you got any result about the performance please? I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested.

The AutoGPTQ kernel is already pretty slow as-is. I have benchmarked the performance against the transformers version, which shows a 3-5x speedup dependent on problem size.

I'm not sure I will have time to implement the kernel right now, but very interested in including it in AutoAWQ in the future. Another candidate is looking into the cutlass kernels that Woosuk is working on:
https://github.com/vllm-project/vllm/tree/cutlass-moe

casper-hansen · 2024-02-16T09:34:32Z

@chu-tianxiang closing this issue as I have now implemented fused MoE based on your code in vLLM. A new issue has popped up that we can discuss separately: #341

casper-hansen closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Idea: Dequantize + Fused MoE #323

Idea: Dequantize + Fused MoE #323

casper-hansen commented Jan 26, 2024

chu-tianxiang commented Feb 4, 2024

casper-hansen commented Feb 5, 2024

casper-hansen commented Feb 16, 2024

Idea: Dequantize + Fused MoE #323

Idea: Dequantize + Fused MoE #323

Comments

casper-hansen commented Jan 26, 2024

chu-tianxiang commented Feb 4, 2024

casper-hansen commented Feb 5, 2024

casper-hansen commented Feb 16, 2024