From 1d931dcb8662a47881109a138198a3bf9c55f501 Mon Sep 17 00:00:00 2001
From: Andrey Cheptsov <54148038+peterschmidt85@users.noreply.github.com>
Date: Tue, 13 Aug 2024 11:58:15 +0200
Subject: [PATCH] =?UTF-8?q?[Docs]=20Document=20how=20to=20reduce=20cold=20?=
=?UTF-8?q?start=20times=20for=20services=20using=20fle=E2=80=A6=20(#1545)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
* [Docs] Document how to reduce cold start times for services using fleets and volumes #1459
---
docs/blog/posts/volumes-on-runpod.md | 141 +++++++++++++++++++++++++++
1 file changed, 141 insertions(+)
create mode 100644 docs/blog/posts/volumes-on-runpod.md
diff --git a/docs/blog/posts/volumes-on-runpod.md b/docs/blog/posts/volumes-on-runpod.md
new file mode 100644
index 000000000..dc45e7c2f
--- /dev/null
+++ b/docs/blog/posts/volumes-on-runpod.md
@@ -0,0 +1,141 @@
+---
+title: Optimizing inference cold starts on RunPod with volumes
+date: 2024-08-13
+description: "Learn how to use volumes with dstack to optimize model inference cold start times on RunPod."
+slug: volumes-on-runpod
+---
+
+# Optimizing inference cold starts on RunPod with volumes
+
+Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a
+new instance and download the model. This is especially relevant for services with autoscaling when new model replicas
+need to be provisioned quickly.
+
+Let's explore how `dstack` optimizes this process using volumes, with an example of
+deploying a model on RunPod.
+
+
+
+Suppose you want to deploy Llama 3.1 on RunPod as a [service](../../docs/services.md):
+
+
+
+```yaml
+type: service
+name: llama31-service-tgi
+
+replicas: 1..2
+scaling:
+ metric: rps
+ target: 30
+
+image: ghcr.io/huggingface/text-generation-inference:latest
+env:
+ - HUGGING_FACE_HUB_TOKEN
+ - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+ - MAX_INPUT_LENGTH=4000
+ - MAX_TOTAL_TOKENS=4096
+commands:
+ - text-generation-launcher
+port: 80
+
+spot_policy: auto
+
+resources:
+ gpu: 24GB
+
+model:
+ format: openai
+ type: chat
+ name: meta-llama/Meta-Llama-3.1-8B-Instruct
+```
+
+
+
+When you run `dstack apply`, it creates a public endpoint with one service replica. `dstack` will then automatically scale
+the service by adjusting the number of replicas based on traffic.
+
+When starting each replica, `text-generation-launcher` downloads the model to the `/data` folder. For Llama 3.1 8B, this
+usually takes under a minute, but larger models may take longer. Repeated downloads can significantly affect
+auto-scaling efficiency.
+
+Great news: RunPod supports network volumes, which we can use for caching models across multiple replicas.
+
+With `dstack`, you can create a RunPod volume using the following configuration:
+
+
+
+```yaml
+type: volume
+name: llama31-volume
+
+backend: runpod
+region: EU-SE-1
+
+# Required size
+size: 100GB
+```
+
+
+
+Go ahead and create it via `dstack apply`:
+
+
+
+```shell
+$ dstack apply -f examples/mist/volumes/runpod.dstack.yml
+```
+
+
+
+Once the volume is created, attach it to your service by updating the configuration file and mapping the
+volume name to the `/data` path.
+
+
+
+```yaml
+type: service
+name: llama31-service-tgi
+
+replicas: 1..2
+scaling:
+ metric: rps
+ target: 30
+
+volumes:
+ - name: llama31-volume
+ path: /data
+
+image: ghcr.io/huggingface/text-generation-inference:latest
+env:
+ - HUGGING_FACE_HUB_TOKEN
+ - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+ - MAX_INPUT_LENGTH=4000
+ - MAX_TOTAL_TOKENS=4096
+commands:
+ - text-generation-launcher
+port: 80
+
+spot_policy: auto
+
+resources:
+ gpu: 24GB
+
+model:
+ format: openai
+ type: chat
+ name: meta-llama/Meta-Llama-3.1-8B-Instruct
+```
+
+
+
+In this case, `dstack` attaches the specified volume to each new replica. This ensures the model is downloaded only
+once, reducing cold start time in proportion to the model size.
+
+A notable feature of RunPod is that volumes can be attached to multiple containers simultaneously. This capability is
+particularly useful for autoscalable services or distributed tasks.
+
+Using [volumes](../../docs/concepts/volumes.md) not only optimizes inference cold start times but also enhances the
+efficiency of data and model checkpoint loading during training and fine-tuning.
+Whether you're running [tasks](../../docs/tasks.md) or [dev environments](../../docs/dev-environments.md), leveraging
+volumes can significantly streamline your workflow and improve overall performance.
\ No newline at end of file