dstackai · peterschmidt85 · Sep 5, 2024 · Sep 5, 2024
diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md
@@ -223,8 +223,8 @@ you can set the [`termination_idle_time`](../reference/dstack.yml/fleet.md#termi
 
 ## What's next?
 
-1. Read about [dev environments](dev-environments.md), [tasks](tasks.md), and 
-    [services](services.md) 
+1. Read about [dev environments](../dev-environments.md), [tasks](../tasks.md), and 
+    [services](../services.md) 
 2. Join the community via [Discord :material-arrow-top-right-thin:{ .external }](https://discord.gg/u8SmfwPpMd)
 
 !!! info "Reference"

diff --git a/docs/examples/accelerators/tpu/index.md b/docs/examples/accelerators/tpu/index.md
diff --git a/examples/accelerators/tpu/README.md b/examples/accelerators/tpu/README.md
@@ -0,0 +1,199 @@
+# TPU
+
+If you're using the `gcp` backend, you can use TPUs. Just specify the TPU version and the number of cores 
+(separated by a dash), in the `gpu` property under `resources`. 
+
+> Currently, maximum 8 TPU cores can be specified, so the maximum supported values are `v2-8`, `v3-8`, `v4-8`, `v5litepod-8`, 
+> and `v5e-8`. Multi-host TPU support, allowing for larger numbers of cores, is coming soon.
+
+Below are a few examples on using TPUs for deployment and fine-tuning.
+
+## Deployment
+
+### Running as a service
+You can use any serving framework, such as vLLM, TGI. Here's an example of a [service](https://dstack.ai/docs/services) that deploys
+Llama 3.1 8B using 
+[Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"}
+and [vLLM :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm){:target="_blank"}.
+
+=== "Optimum TPU"
+
+    <div editor-title="examples/deployment/optimum-tpu/service.dstack.yml"> 
+
+    ```yaml
+    type: service
+    name: llama31-service-optimum-tpu
+
+    image: dstackai/optimum-tpu:llama31
+    env:
+      - HUGGING_FACE_HUB_TOKEN
+      - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+      - MAX_TOTAL_TOKENS=4096
+      - MAX_BATCH_PREFILL_TOKENS=4095
+    commands:
+      - text-generation-launcher --port 8000
+    port: 8000
+
+    spot_policy: auto
+    resources:
+      gpu: v5litepod-4 
+
+    model:
+      format: tgi
+      type: chat
+      name: meta-llama/Meta-Llama-3.1-8B-Instruct
+    ```
+    </div>
+
+    Note, for `Optimum TPU` by default `MAX_INPUT_TOKEN` is set to 4095, consequently we must set `MAX_BATCH_PREFILL_TOKENS` to 4095.
+
+    ??? info "Docker image"
+        The official Docker image `huggingface/optimum-tpu:latest` doesn’t support Llama 3.1-8B. 
+        We’ve created a custom image with the fix: `dstackai/optimum-tpu:llama31`. 
+        Once the [pull request :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/pull/87){:target="_blank"} is merged, 
+        the official Docker image can be used.
+
+=== "vLLM"
+    <div editor-title="examples/deployment/vllm/service-tpu.dstack.yml"> 
+
+    ```yaml
+    type: service
+    name: llama31-service-vllm-tpu
+
+    env:
+      - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+      - HUGGING_FACE_HUB_TOKEN
+      - DATE=20240828
+      - TORCH_VERSION=2.5.0
+      - VLLM_TARGET_DEVICE=tpu
+      - MAX_MODEL_LEN=4096
+    commands:
+      - pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
+      - pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
+      - pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+      - pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
+      - git clone https://github.com/vllm-project/vllm.git
+      - cd vllm
+      - pip install -r requirements-tpu.txt
+      - apt-get install -y libopenblas-base libopenmpi-dev libomp-dev
+      - python setup.py develop
+      - vllm serve $MODEL_ID 
+          --tensor-parallel-size 4 
+          --max-model-len $MAX_MODEL_LEN
+          --port 8000
+    port:
+      - 8000
+
+    spot_policy: auto
+    resources:
+      gpu: v5litepod-4
+
+    model:
+      format: openai
+      type: chat
+      name: meta-llama/Meta-Llama-3.1-8B-Instruct
+    ```
+    </div>
+
+    Note, when using Llama 3.1 8B with a `v5litepod` which has 16GB memory per core, we must limit the context size to 4096 tokens to fit the memory.
+
+### Memory requirements
+
+Below are the approximate memory requirements for serving LLMs with their corresponding TPUs. 
+
+| Model size | bfloat16 | TPU          | int8  | TPU            |
+|------------|----------|--------------|-------|----------------|
+| **8B**     | 16GB     | v5litepod-4  | 8GB   | v5litepod-4    |
+| **70B**    | 140GB    | v5litepod-16 | 70GB  | v5litepod-16   |
+| **405B**   | 810GB    | v5litepod-64 | 405GB | v5litepod-64   |
+
+Note, `v5litepod` is optimized for serving transformer-based models. Each core is equipped with 16GB of memory.
+
+### Supported frameworks
+
+| Framework | Quantization   | Note                                                                                                                                                                                                                                                                                             |
+|-----------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| **TGI**   | bfloat16       | To deploy with TGI, Optimum TPU must be used.                                                                                                                                                                                                                                                    |
+| **vLLM**  | int8, bfloat16 | int8 quantization still requires the same memory because the weights are first moved to the TPU in bfloat16, and then converted to int8. See the [pull request :material-arrow-top-right-thin:{ .external }](https://github.com/vllm-project/vllm/pull/7005){:target="_blank"} for more details. |
+
+### Running a configuration
+
+Once the configuration is ready, run `dstack apply -f <configuration file>`, and `dstack` will automatically provision the
+cloud resources and run the configuration.
+
+## Fine-tuning with Optimum TPU
+
+Below is an example of fine-tuning Llama 3.1 8B using [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu){:target="_blank"} 
+and the [Abirate/english_quotes :material-arrow-top-right-thin:{ .external }](https://huggingface.co/datasets/Abirate/english_quotes){:target="_blank"}
+dataset.
+
+<div editor-title="examples/fine-tuning/optimum-tpu/llama31/train.dstack.yml"> 
+
+```yaml
+type: task
+name: optimum-tpu-llama-train
+
+python: "3.11"
+
+env:
+  - HUGGING_FACE_HUB_TOKEN
+commands:
+  - git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
+  - mkdir -p optimum-tpu/examples/custom/
+  - cp examples/fine-tuning/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
+  - cp examples/fine-tuning/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
+  - cd optimum-tpu
+  - pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
+  - pip install datasets evaluate
+  - pip install accelerate -U
+  - pip install peft
+  - python examples/custom/train.py examples/custom/config.yaml
+
+
+resources:
+  gpu: v5litepod-8
+```
+
+</div>
+
+[//]: # (### Fine-Tuning with TRL)
+[//]: # (Use the example `examples/fine-tuning/optimum-tpu/gemma/train.dstack.yml` to Finetune `Gemma-2B` model using `trl` with `dstack` and `optimum-tpu`. )
+
+### Memory requirements
+
+Below are the approximate memory requirements for fine-tuning LLMs with their corresponding TPUs.
+
+| Model size | LoRA  | TPU          |
+|------------|-------|--------------|
+| **8B**     | 16GB  | v5litepod-8  |
+| **70B**    | 160GB | v5litepod-16 |
+| **405B**   | 950GB | v5litepod-64 |
+
+Note, `v5litepod` is optimized for fine-tuning transformer-based models. Each core is equipped with 16GB of memory.
+
+### Supported frameworks
+
+| Framework       | Quantization | Note                                                                                              |
+|-----------------|--------------|---------------------------------------------------------------------------------------------------|
+| **TRL**         | bfloat16     | To fine-tune using TRL, Optimum TPU is recommended. TRL doesn't support Llama 3.1 out of the box. |
+| **Pytorch XLA** | bfloat16     |                                                                                                   |
+
+## Dev environments
+
+Before running a task or service, it's recommended that you first start with
+a [dev environment](https://dstack.ai/docs/dev-environments). Dev environments
+allow you to run commands interactively.
+
+## Source code
+
+The source-code of this example can be found in 
+[examples/deployment/optimum-tpu :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/llms/llama31){:target="_blank"}
+and [examples/fine-tuning/optimum-tpu :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/blob/master/examples/fine-tuning/trl){:target="_blank"}.
+
+## What's next?
+
+1. Browse [Optimum TPU :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu),
+   [Optimum TPU TGI :material-arrow-top-right-thin:{ .external }](https://github.com/huggingface/optimum-tpu/tree/main/text-generation-inference) and
+   [vLLM :material-arrow-top-right-thin:{ .external }](https://docs.vllm.ai/en/latest/getting_started/tpu-installation.html).
+2. Check [dev environments](https://dstack.ai/docs/dev-environments), [tasks](https://dstack.ai/docs/tasks), 
+   [services](https://dstack.ai/docs/services), and [fleets](https://dstack.ai/docs/fleets).
diff --git a/examples/deployment/optimum-tpu/.dstack.yml b/examples/deployment/optimum-tpu/.dstack.yml
@@ -0,0 +1,18 @@
+type: dev-environment
+# The name is optional, if not specified, generated randomly
+name: vscode-optimum-tpu
+
+# Using a Docker image with a fix instead of the official one
+# More details at https://github.com/huggingface/optimum-tpu/pull/87
+image: dstackai/optimum-tpu:llama31
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+ide: vscode
+
+resources:
+  # Required resources
+  gpu: v5litepod-4
+
+# Use either spot or on-demand instances
+spot_policy: auto
diff --git a/examples/deployment/optimum-tpu/service.dstack.yml b/examples/deployment/optimum-tpu/service.dstack.yml
@@ -0,0 +1,28 @@
+type: service
+# The name is optional, if not specified, generated randomly
+name: llama31-service-optimum-tpu
+
+# Using a Docker image with a fix instead of the official one
+# More details at https://github.com/huggingface/optimum-tpu/pull/87
+image: dstackai/optimum-tpu:llama31
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+  - MAX_TOTAL_TOKENS=4096
+  - MAX_BATCH_PREFILL_TOKENS=4095
+commands:
+  - text-generation-launcher --port 8000
+port: 8000
+
+resources:
+  # Required resources
+  gpu: v5litepod-4
+
+# Use either spot or on-demand instances
+spot_policy: auto
+
+model:
+  format: tgi
+  type: chat
+  name: meta-llama/Meta-Llama-3.1-8B-Instruct
diff --git a/examples/deployment/optimum-tpu/task.dstack.yml b/examples/deployment/optimum-tpu/task.dstack.yml
@@ -0,0 +1,23 @@
+type: task
+# The name is optional, if not specified, generated randomly
+name: llama31-task-optimum-tpu
+
+# Using a Docker image with a fix instead of the official one
+# More details at https://github.com/huggingface/optimum-tpu/pull/87
+image: dstackai/optimum-tpu:llama31
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+  - MAX_TOTAL_TOKENS=4096
+  - MAX_BATCH_PREFILL_TOKENS=4095
+commands:
+  - text-generation-launcher --port 8000
+ports: [8000]
+
+resources:
+  # Required resources
+  gpu: v5litepod-4
+
+# Use either spot or on-demand instances
+spot_policy: auto
diff --git a/examples/deployment/vllm/service-tpu.dstack.yml b/examples/deployment/vllm/service-tpu.dstack.yml
@@ -0,0 +1,40 @@
+type: service
+# The name is optional, if not specified, generated randomly
+name: llama31-service-vllm-tpu
+
+env:
+  - HUGGING_FACE_HUB_TOKEN
+  - MODEL_ID=meta-llama/Meta-Llama-3.1-8B-Instruct
+  - DATE=20240828
+  - TORCH_VERSION=2.5.0
+  - VLLM_TARGET_DEVICE=tpu
+  - MAX_MODEL_LEN=4096
+
+commands:
+  - pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
+  - pip3 install https://storage.googleapis.com/pytorch-xla-releases/wheels/tpuvm/torch_xla-${TORCH_VERSION}.dev${DATE}-cp311-cp311-linux_x86_64.whl
+  - pip install torch_xla[tpu] -f https://storage.googleapis.com/libtpu-releases/index.html
+  - pip install torch_xla[pallas] -f https://storage.googleapis.com/jax-releases/jax_nightly_releases.html -f https://storage.googleapis.com/jax-releases/jaxlib_nightly_releases.html
+  - git clone https://github.com/vllm-project/vllm.git
+  - cd vllm
+  - pip install -r requirements-tpu.txt
+  - apt-get install -y libopenblas-base libopenmpi-dev libomp-dev
+  - python setup.py develop
+  - vllm serve $MODEL_ID 
+      --tensor-parallel-size 4 
+      --max-model-len $MAX_MODEL_LEN
+      --port 8000
+
+# Expose the vllm server port
+port: 8000
+
+spot_policy: auto
+
+resources:
+  gpu: v5litepod-4
+
+# (Optional) Enable the OpenAI-compatible endpoint
+model:
+  format: openai
+  type: chat
+  name: meta-llama/Meta-Llama-3.1-8B-Instruct
diff --git a/examples/fine-tuning/optimum-tpu/llama31/.dstack.yml b/examples/fine-tuning/optimum-tpu/llama31/.dstack.yml
@@ -0,0 +1,31 @@
+type: dev-environment
+# The name is optional, if not specified, generated randomly
+name: optimum-tpu-vscode
+
+# If `image` is not specified, dstack uses its default image
+python: "3.11"
+
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+
+# Refer to Note section in examples/gpus/tpu/README.md for more information about the optimum-tpu repository.
+# Uncomment if you want the environment to be pre-installed
+#init:
+#  - git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
+#  - mkdir -p optimum-tpu/examples/custom/
+#  - cp examples/fine-tuning/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
+#  - cp examples/fine-tuning/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
+#  - cd optimum-tpu
+#  - pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
+#  - pip install datasets evaluate
+#  - pip install accelerate -U
+#  - pip install peft
+
+ide: vscode
+
+# Use either spot or on-demand instances
+spot_policy: auto
+
+resources:
+  gpu: v5litepod-8
diff --git a/examples/fine-tuning/optimum-tpu/llama31/config.yaml b/examples/fine-tuning/optimum-tpu/llama31/config.yaml
@@ -0,0 +1,10 @@
+per_device_train_batch_size: 24
+per_device_eval_batch_size: 8
+num_train_epochs: 1
+max_steps: -1
+output_dir: "./finetuned_models/llama3_fine_tuned"
+optim: "adafactor"
+dataset_name: "Abirate/english_quotes"
+model_name: "meta-llama/Meta-Llama-3.1-8B"
+lora_r: 4
+push_to_hub: True
diff --git a/examples/fine-tuning/optimum-tpu/llama31/train.dstack.yml b/examples/fine-tuning/optimum-tpu/llama31/train.dstack.yml
@@ -0,0 +1,25 @@
+type: task
+# The name is optional, if not specified, generated randomly
+name: optimum-tpu-llama-train
+
+python: "3.11"
+
+# Required environment variables
+env:
+  - HUGGING_FACE_HUB_TOKEN
+
+# Commands of the task
+commands:
+  - git clone -b add_llama_31_support https://github.com/dstackai/optimum-tpu.git
+  - mkdir -p optimum-tpu/examples/custom/
+  - cp examples/fine-tuning/optimum-tpu/llama31/train.py optimum-tpu/examples/custom/train.py
+  - cp examples/fine-tuning/optimum-tpu/llama31/config.yaml optimum-tpu/examples/custom/config.yaml
+  - cd optimum-tpu
+  - pip install -e . -f https://storage.googleapis.com/libtpu-releases/index.html
+  - pip install datasets evaluate
+  - pip install accelerate -U
+  - pip install peft
+  - python examples/custom/train.py examples/custom/config.yaml
+
+resources:
+  gpu: v5litepod-8