You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, Tsinghua University proposed a survey related to LLM inference acceleration, comparing TensorRT LLM and LMDeploy under AWQ. From the results, LMDeploy has a higher speed-ups for large batches, while TensorRT LLM has a higher speed-ups for small batches. Previously, LMDeploy had been optimizing throughput and internet companies still pay attention to throughput under limited latency during actual online serving. Therefore, optimizing the latency acceleration ratio for small batches might also be meaningful. If interested, you may take a look. Cheers. @lzhangzz@irexyc@lvhan028@grimoire
This requirement has also been mentioned several times in the discussion of other issues, and @lzhangzz ‘s expectation was to support it around July. Perhaps it could also change, after all, plans cannot keep up with changes, and there are many higher priority matters at hand now.
Motivation
Recently, Tsinghua University proposed a survey related to LLM inference acceleration, comparing TensorRT LLM and LMDeploy under AWQ. From the results, LMDeploy has a higher speed-ups for large batches, while TensorRT LLM has a higher speed-ups for small batches. Previously, LMDeploy had been optimizing throughput and internet companies still pay attention to throughput under limited latency during actual online serving. Therefore, optimizing the latency acceleration ratio for small batches might also be meaningful. If interested, you may take a look. Cheers. @lzhangzz @irexyc @lvhan028 @grimoire
https://arxiv.org/pdf/2404.14294
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: