Skip to content

Commit

Permalink
- [Blog] AMD MI300X inference benchmark #1806 (WIP)
Browse files Browse the repository at this point in the history
  • Loading branch information
peterschmidt85 committed Oct 9, 2024
1 parent 123352f commit eb52348
Showing 1 changed file with 3 additions and 2 deletions.
5 changes: 3 additions & 2 deletions docs/blog/posts/amd-mi300x-inference-benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ and continued this pattern up to 150 requests at 5 RPS.

Ideally, we would expect all trials to complete within the same time frame. However, due to resource limitations and
increasing resource utilization, higher RPS does not lead to a proportional increase in throughput (tokens per second)
or maintain total time from first token (TTFT).
or maintain Time to First Token (TTFT).

<img src="https://raw.githubusercontent.com/dstackai/benchmarks/refs/heads/main/amd/inference/charts_rps/mean_ttft_low_tgi_vllm.png" width="725" style="padding: 0 40px 0 50px"/>

Expand Down Expand Up @@ -181,7 +181,8 @@ This difference may be related to how vLLM [pre-allocates GPU cache :material-ar
throughput.
- Conversely, vLLM works well with lower RPS but struggles to scale, making it less ideal for more demanding workloads.
- TGI's edge comes from
its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"} , which dynamically modifies batch sizes to optimize GPU usage.
its [continuous batching algorithm :material-arrow-top-right-thin:{ .external }](https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi){:target="_blank"}, which dynamically modifies batch sizes to optimize GPU usage.


To gain a more complete understanding of the performance potential, a wider variety of backend configurations should be tested.

Expand Down

0 comments on commit eb52348

Please sign in to comment.