Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Dequantize + Fused MoE #323

Closed
casper-hansen opened this issue Jan 26, 2024 · 3 comments
Closed

Idea: Dequantize + Fused MoE #323

casper-hansen opened this issue Jan 26, 2024 · 3 comments

Comments

@casper-hansen
Copy link
Owner

Based on the amazing work by @zwd003 and @pcmoritz, a potential strategy for a speedup in AutoAWQ could be to run dequantization and then Fused MoE. I have my doubts if this will give a speedup during decoding, but it's worth a shot since the Triton kernel is already developed and requires minimal effort to integrate.

Ideally, we could dequantize during the Triton kernel to remove any additional overhead and make it as optimized as possible. However, that does require integrating the dequantization code with the Fused MoE code.

Fused MoE kernel: vllm-project/vllm#2542
AWQ Triton Dequantization: https://github.com/vllm-project/vllm/blob/qmm/vllm/model_executor/layers/quantization/ops/awq.py

@chu-tianxiang
Copy link

Have you got any result about the performance please? I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested.

@casper-hansen
Copy link
Owner Author

Have you got any result about the performance please? I tried integrating the fused MoE Triton kernel with the AutoGPTQ triton kernel yesterday, however it turned out to be a lot slower than the old vllm implementation, end-to-end latency is over 30% worse at all batch sizes I tested.

The AutoGPTQ kernel is already pretty slow as-is. I have benchmarked the performance against the transformers version, which shows a 3-5x speedup dependent on problem size.

I'm not sure I will have time to implement the kernel right now, but very interested in including it in AutoAWQ in the future. Another candidate is looking into the cutlass kernels that Woosuk is working on:
https://github.com/vllm-project/vllm/tree/cutlass-moe

@casper-hansen
Copy link
Owner Author

@chu-tianxiang closing this issue as I have now implemented fused MoE based on your code in vLLM. A new issue has popped up that we can discuss separately: #341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants