diff --git a/docs/blog/posts/amd-mi300x-inference-benchmark.md b/docs/blog/posts/amd-mi300x-inference-benchmark.md index ef7021ae0..4f3f588cf 100644 --- a/docs/blog/posts/amd-mi300x-inference-benchmark.md +++ b/docs/blog/posts/amd-mi300x-inference-benchmark.md @@ -154,14 +154,14 @@ or maintain Time to First Token (TTFT). At 1 RPS, vLLM performs slightly better than TGI. However, between 2 and 4 RPS, TGI outperforms vLLM in both throughput and TTFT. -Notably, TGI begins to drop requests once it reaches 5 RPS. +> Notably, TGI begins to drop requests once it reaches 5 RPS. We repeated the test using a higher number of requests, ranging from 300 to 900. -At 900 requests with a rate of 3 requests per second (RPS), TGI dropped a majority of the requests. However, its -performance improved notably when the number of requests was below 900. +> At 900 requests with a rate of 3 requests per second (RPS), TGI dropped a majority of the requests. However, its +> performance improved notably when the number of requests was below 900. @@ -176,22 +176,26 @@ This difference may be related to how vLLM [pre-allocates GPU cache :material-ar ## Conclusion -- TGI is highly efficient at handling medium to high workloads. In our tests on 8x AMD MI300X GPU, medium workloads - are defined as RPS between 2 and 4. In these cases, it delivers faster time to first token (TTFT) and higher - throughput. -- Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads. -- TGI's edge comes from - its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"}, which dynamically modifies batch sizes to optimize GPU usage. +1. For small sequence lengths, starting with a batch size of 64, TGI significantly outperforms vLLM in terms of throughput and TTFT. +2. For larger sequence lengths, TGI outperforms vLLM even more in both throughput and TTFT, with the difference increasing as the batch size grows. +3. At higher request rates, TGI continues to outperform vLLM, likely due to its superior ability to batch requests efficiently. +!!! info "Limitation" + * In certain circumstances (e.g., at higher request rates), for unknown reasons, TGI dropped requests, making it + impossible to accurately track throughput and TTFT. + * With vLLM, we used the default backend configuration. With better tuning, we might have achieved improved performance. -To gain a more complete understanding of the performance potential, a wider variety of backend configurations should be tested. +In general, the 8x AMD MI300X is a good fit for larger models and allows us to make the most of its vRAM, especially for +larger batches. + +If you’d like to support us in doing more benchmarks, please let us know. ## What's next? While we wait for AMD to announce new GPUs and for data centers to offer them, we’re considering tests with NVIDIA GPUs -like the H100 and H200, and possibly Google TPU. +like the H100 and H200, as well as possibly Google TPU. -If you’d like to support us in doing more benchmarks, please let us know. +> Also, the next step is to measure how the FP8 version of the model would perform on this hardware. ### Source code