From eb52348cb2bd4a14b76efa2a3b83e57186fdac96 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Wed, 9 Oct 2024 23:01:37 +0200 Subject: [PATCH] - [Blog] AMD MI300X inference benchmark #1806 (WIP) --- docs/blog/posts/amd-mi300x-inference-benchmark.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/amd-mi300x-inference-benchmark.md b/docs/blog/posts/amd-mi300x-inference-benchmark.md index afea04fcb..ef7021ae0 100644 --- a/docs/blog/posts/amd-mi300x-inference-benchmark.md +++ b/docs/blog/posts/amd-mi300x-inference-benchmark.md @@ -148,7 +148,7 @@ and continued this pattern up to 150 requests at 5 RPS. Ideally, we would expect all trials to complete within the same time frame. However, due to resource limitations and increasing resource utilization, higher RPS does not lead to a proportional increase in throughput (tokens per second) -or maintain total time from first token (TTFT). +or maintain Time to First Token (TTFT). @@ -181,7 +181,8 @@ This difference may be related to how vLLM [pre-allocates GPU cache :material-ar throughput. - Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads. - TGI's edge comes from - its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"} , which dynamically modifies batch sizes to optimize GPU usage. + its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"}, which dynamically modifies batch sizes to optimize GPU usage. + To gain a more complete understanding of the performance potential, a wider variety of backend configurations should be tested.