From 1d931dcb8662a47881109a138198a3bf9c55f501 Mon Sep 17 00:00:00 2001 From: Andrey Cheptsov <54148038+peterschmidt85@users.noreply.github.com> Date: Tue, 13 Aug 2024 11:58:15 +0200 Subject: [PATCH] =?UTF-8?q?[Docs]=20Document=20how=20to=20reduce=20cold=20?= =?UTF-8?q?start=20times=20for=20services=20using=20fle=E2=80=A6=20(#1545)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * [Docs] Document how to reduce cold start times for services using fleets and volumes #1459 --- docs/blog/posts/volumes-on-runpod.md | 141 +++++++++++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 docs/blog/posts/volumes-on-runpod.md diff --git a/docs/blog/posts/volumes-on-runpod.md b/docs/blog/posts/volumes-on-runpod.md new file mode 100644 index 000000000..dc45e7c2f --- /dev/null +++ b/docs/blog/posts/volumes-on-runpod.md @@ -0,0 +1,141 @@ +--- +title: Optimizing inference cold starts on RunPod with volumes +date: 2024-08-13 +description: "Learn how to use volumes with dstack to optimize model inference cold start times on RunPod." +slug: volumes-on-runpod +--- + +# Optimizing inference cold starts on RunPod with volumes + +Deploying custom models in the cloud often faces the challenge of cold start times, including the time to provision a +new instance and download the model. This is especially relevant for services with autoscaling when new model replicas +need to be provisioned quickly. + +Let's explore how `dstack` optimizes this process using volumes, with an example of +deploying a model on RunPod. + + + +Suppose you want to deploy Llama 3.1 on RunPod as a [service](../../docs/services.md): + +
+ +```yaml +type: service +name: llama31-service-tgi + +replicas: 1..2 +scaling: + metric: rps + target: 30 + +image: ghcr.io/huggingface/text-generation-inference:latest +env: + - HUGGING_FACE_HUB_TOKEN + - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct + - MAX_INPUT_LENGTH=4000 + - MAX_TOTAL_TOKENS=4096 +commands: + - text-generation-launcher +port: 80 + +spot_policy: auto + +resources: + gpu: 24GB + +model: + format: openai + type: chat + name: meta-llama/Meta-Llama-3.1-8B-Instruct +``` + +
+ +When you run `dstack apply`, it creates a public endpoint with one service replica. `dstack` will then automatically scale +the service by adjusting the number of replicas based on traffic. + +When starting each replica, `text-generation-launcher` downloads the model to the `/data` folder. For Llama 3.1 8B, this +usually takes under a minute, but larger models may take longer. Repeated downloads can significantly affect +auto-scaling efficiency. + +Great news: RunPod supports network volumes, which we can use for caching models across multiple replicas. + +With `dstack`, you can create a RunPod volume using the following configuration: + +
+ +```yaml +type: volume +name: llama31-volume + +backend: runpod +region: EU-SE-1 + +# Required size +size: 100GB +``` + +
+ +Go ahead and create it via `dstack apply`: + +
+ +```shell +$ dstack apply -f examples/mist/volumes/runpod.dstack.yml +``` + +
+ +Once the volume is created, attach it to your service by updating the configuration file and mapping the +volume name to the `/data` path. + +
+ +```yaml +type: service +name: llama31-service-tgi + +replicas: 1..2 +scaling: + metric: rps + target: 30 + +volumes: + - name: llama31-volume + path: /data + +image: ghcr.io/huggingface/text-generation-inference:latest +env: + - HUGGING_FACE_HUB_TOKEN + - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct + - MAX_INPUT_LENGTH=4000 + - MAX_TOTAL_TOKENS=4096 +commands: + - text-generation-launcher +port: 80 + +spot_policy: auto + +resources: + gpu: 24GB + +model: + format: openai + type: chat + name: meta-llama/Meta-Llama-3.1-8B-Instruct +``` + +
+ +In this case, `dstack` attaches the specified volume to each new replica. This ensures the model is downloaded only +once, reducing cold start time in proportion to the model size. + +A notable feature of RunPod is that volumes can be attached to multiple containers simultaneously. This capability is +particularly useful for autoscalable services or distributed tasks. + +Using [volumes](../../docs/concepts/volumes.md) not only optimizes inference cold start times but also enhances the +efficiency of data and model checkpoint loading during training and fine-tuning. +Whether you're running [tasks](../../docs/tasks.md) or [dev environments](../../docs/dev-environments.md), leveraging +volumes can significantly streamline your workflow and improve overall performance. \ No newline at end of file