-
Notifications
You must be signed in to change notification settings - Fork 134
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Docs] Document how to reduce cold start times for services using fle… (
- Loading branch information
1 parent
0543e5e
commit 1d931dc
Showing
1 changed file
with
141 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
title: Optimizing inference cold starts on RunPod with volumes | ||
date: 2024-08-13 | ||
description: "Learn how to use volumes with dstack to optimize model inference cold start times on RunPod." | ||
slug: volumes-on-runpod | ||
--- | ||
|
||
# Optimizing inference cold starts on RunPod with volumes | ||
|
||
Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a | ||
new instance and download the model. This is especially relevant for services with autoscaling when new model replicas | ||
need to be provisioned quickly. | ||
|
||
Let's explore how `dstack` optimizes this process using volumes, with an example of | ||
deploying a model on RunPod. | ||
|
||
<!-- more --> | ||
|
||
Suppose you want to deploy Llama 3.1 on RunPod as a [service](../../docs/services.md): | ||
|
||
<div editor-title="examples/llms/llama31/tgi/service.dstack.yml"> | ||
|
||
```yaml | ||
type: service | ||
name: llama31-service-tgi | ||
|
||
replicas: 1..2 | ||
scaling: | ||
metric: rps | ||
target: 30 | ||
|
||
image: ghcr.io/huggingface/text-generation-inference:latest | ||
env: | ||
- HUGGING_FACE_HUB_TOKEN | ||
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct | ||
- MAX_INPUT_LENGTH=4000 | ||
- MAX_TOTAL_TOKENS=4096 | ||
commands: | ||
- text-generation-launcher | ||
port: 80 | ||
|
||
spot_policy: auto | ||
|
||
resources: | ||
gpu: 24GB | ||
|
||
model: | ||
format: openai | ||
type: chat | ||
name: meta-llama/Meta-Llama-3.1-8B-Instruct | ||
``` | ||
</div> | ||
When you run `dstack apply`, it creates a public endpoint with one service replica. `dstack` will then automatically scale | ||
the service by adjusting the number of replicas based on traffic. | ||
|
||
When starting each replica, `text-generation-launcher` downloads the model to the `/data` folder. For Llama 3.1 8B, this | ||
usually takes under a minute, but larger models may take longer. Repeated downloads can significantly affect | ||
auto-scaling efficiency. | ||
|
||
Great news: RunPod supports network volumes, which we can use for caching models across multiple replicas. | ||
|
||
With `dstack`, you can create a RunPod volume using the following configuration: | ||
|
||
<div editor-title="examples/mist/volumes/runpod.dstack.yml"> | ||
|
||
```yaml | ||
type: volume | ||
name: llama31-volume | ||
backend: runpod | ||
region: EU-SE-1 | ||
# Required size | ||
size: 100GB | ||
``` | ||
|
||
</div> | ||
|
||
Go ahead and create it via `dstack apply`: | ||
|
||
<div class="termy"> | ||
|
||
```shell | ||
$ dstack apply -f examples/mist/volumes/runpod.dstack.yml | ||
``` | ||
|
||
</div> | ||
|
||
Once the volume is created, attach it to your service by updating the configuration file and mapping the | ||
volume name to the `/data` path. | ||
|
||
<div editor-title="examples/llms/llama31/tgi/service.dstack.yml"> | ||
|
||
```yaml | ||
type: service | ||
name: llama31-service-tgi | ||
replicas: 1..2 | ||
scaling: | ||
metric: rps | ||
target: 30 | ||
volumes: | ||
- name: llama31-volume | ||
path: /data | ||
image: ghcr.io/huggingface/text-generation-inference:latest | ||
env: | ||
- HUGGING_FACE_HUB_TOKEN | ||
- MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct | ||
- MAX_INPUT_LENGTH=4000 | ||
- MAX_TOTAL_TOKENS=4096 | ||
commands: | ||
- text-generation-launcher | ||
port: 80 | ||
spot_policy: auto | ||
resources: | ||
gpu: 24GB | ||
model: | ||
format: openai | ||
type: chat | ||
name: meta-llama/Meta-Llama-3.1-8B-Instruct | ||
``` | ||
|
||
</div> | ||
|
||
In this case, `dstack` attaches the specified volume to each new replica. This ensures the model is downloaded only | ||
once, reducing cold start time in proportion to the model size. | ||
|
||
A notable feature of RunPod is that volumes can be attached to multiple containers simultaneously. This capability is | ||
particularly useful for autoscalable services or distributed tasks. | ||
|
||
Using [volumes](../../docs/concepts/volumes.md) not only optimizes inference cold start times but also enhances the | ||
efficiency of data and model checkpoint loading during training and fine-tuning. | ||
Whether you're running [tasks](../../docs/tasks.md) or [dev environments](../../docs/dev-environments.md), leveraging | ||
volumes can significantly streamline your workflow and improve overall performance. |