diff --git a/README.md b/README.md index 9c2880ac2..c443a0f9b 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ Supported providers: AWS, GCP, Azure, Lambda, TensorDock, Vast.ai, and DataCrunc ## Latest news ✨ -- [2024/01] [dstack 0.14.0rc1: OpenAI-compatible endpoints](https://dstack.ai/blog/2024/01/19/openai-endpoints-preview/) (Preview) +- [2024/01] [dstack 0.14.0: OpenAI-compatible endpoints preview](https://dstack.ai/blog/2024/01/19/openai-endpoints-preview/) (Release) - [2023/12] [dstack 0.13.0: Disk size, CUDA 12.1, Mixtral, and more](https://dstack.ai/blog/2023/12/22/disk-size-cuda-12-1-mixtral-and-more/) (Release) - [2023/11] [dstack 0.12.3: Vast.ai integration](https://dstack.ai/blog/2023/11/21/vastai/) (Release) - [2023/10] [dstack 0.12.2: TensorDock integration](https://dstack.ai/blog/2023/10/31/tensordock/) (Release) diff --git a/docs/blog/posts/openai-endpoints-preview.md b/docs/blog/posts/openai-endpoints-preview.md index 30a46f88f..f0b509995 100644 --- a/docs/blog/posts/openai-endpoints-preview.md +++ b/docs/blog/posts/openai-endpoints-preview.md @@ -1,24 +1,23 @@ --- -title: "dstack 0.14.0rc1: OpenAI-compatible endpoints preview" +title: "dstack 0.14.0: OpenAI-compatible endpoints preview" date: 2024-01-19 description: "Making it easier to deploy custom LLMs as OpenAI-compatible endpoints." slug: "openai-endpoints-preview" categories: - - Previews + - Releases --- -# dstack 0.14.0rc1: OpenAI-compatible endpoints preview +# dstack 0.14.0: OpenAI-compatible endpoints preview __Making it easier to deploy custom LLMs as OpenAI-compatible endpoints.__ The `service` configuration deploys any application as a public endpoint. For instance, you can use HuggingFace's -[TGI](https://github.com/huggingface/text-generation-inference) or frameworks to deploy -custom LLMs. While this is simple and customizable, using different frameworks and LLMs complicates -the integration of LLMs. +[TGI](https://github.com/huggingface/text-generation-inference) or other frameworks to deploy custom LLMs. +While this is simple and customizable, using different frameworks and LLMs complicates the integration of LLMs. -With the upcoming `dstack 0.14.0`, we are extending the `service` configuration in `dstack` to enable you to optionally map your +With `dstack 0.14.0`, we are extending the `service` configuration in `dstack` to enable you to optionally map your custom LLM to an OpenAI-compatible endpoint. Here's how it works: you define a `service` (as before) and include the `model` property with @@ -42,7 +41,7 @@ model: format: tgi ``` -When you deploy this service using `dstack run`, `dstack` will automatically publish the OpenAI-compatible endpoint, +When you deploy the service using `dstack run`, `dstack` will automatically publish the OpenAI-compatible endpoint, converting the prompt and response format between your LLM and OpenAI interface. ```python @@ -63,32 +62,20 @@ completion = client.chat.completions.create( print(completion.choices[0].message) ``` -!!! info "NOTE:" - By default, dstack loads the model's `chat_template` and `eos_token` from Hugging Face. However, you can override them using - the corresponding properties under `model`. - Here's a live demo of how it works: -## Try the preview - -To try the preview of this new upcoming feature, make sure to install `0.14.0rc1` and restart your server. - -```shell -pip install "dstack[all]==0.14.0rc1" -``` +For more details on how to use the new feature, be sure to check the updated documentation on [services](../../docs/concepts/services.md), +and the [TGI](../../examples/tgi.md) example. ## Migration guide -Note: In order to use the new feature, it's important to delete your existing gateway (if any) -using `dstack gateway delete` and then create it again with `dstack gateway create`. - -## Why does this matter? - -With `dstack`, you can train and deploy models using any cloud providers, easily leveraging GPU availability across -providers, spot instances, multiple regions, and more. +Note: After you update to `0.14.0`, it's important to delete your existing gateways (if any) +using `dstack gateway delete` and create them again with `dstack gateway create`. ## Feedback -Do you have any questions or need assistance? Feel free to join our [Discord server](https://discord.gg/u8SmfwPpMd). \ No newline at end of file +In case you have any questions, experience bugs, or need help, +drop us a message on our [Discord server](https://discord.gg/u8SmfwPpMd) or submit it as a +[GitHub issue](https://github.com/dstackai/dstack/issues/new/choose). \ No newline at end of file diff --git a/docs/blog/posts/simplified-cloud-setup.md b/docs/blog/posts/simplified-cloud-setup.md index 82ae2f32d..09fb5cf3d 100644 --- a/docs/blog/posts/simplified-cloud-setup.md +++ b/docs/blog/posts/simplified-cloud-setup.md @@ -39,7 +39,7 @@ projects: Regions and other settings are optional. Learn more on what credential types are supported -via [Clouds](../../docs/config/server.md). +via [Clouds](../../docs/installation/index.md). ## Enhanced API @@ -97,7 +97,7 @@ This means you'll need to delete `~/.dstack` and configure `dstack` from scratch 1. `pip install "dstack[all]==0.12.0"` 2. Delete `~/.dstack` -3. Configure clouds via `~/.dstack/server/config.yml` (see the [new guide](../../docs/config/server.md)) +3. Configure clouds via `~/.dstack/server/config.yml` (see the [new guide](../../docs/installation/index.md)) 4. Run `dstack server` The [documentation](../../docs/index.md) and [examples](../../examples/index.md) are updated. diff --git a/docs/docs/concepts/services.md b/docs/docs/concepts/services.md index 62fd046c6..d9d87962a 100644 --- a/docs/docs/concepts/services.md +++ b/docs/docs/concepts/services.md @@ -1,9 +1,7 @@ # Services -With `dstack`, you can use the CLI or API to deploy models or web apps. -Provide the commands, port, and choose the Python version or a Docker image. - -`dstack` handles the deployment on configured cloud GPU provider(s) with the necessary resources. +Services make it easy to deploy models and apps as public endpoints, allowing you to use any +frameworks. ??? info "Prerequisites" @@ -31,14 +29,15 @@ Provide the commands, port, and choose the Python version or a Docker image. Afterward, in your domain's DNS settings, add an `A` DNS record for `*.example.com` pointing to the IP address of the gateway. - This way, if you run a service, `dstack` will make its endpoint available at - `https://.example.com`. + Now, if you run a service, `dstack` will make its endpoint available at + `https://.`. -If you're using the cloud version of `dstack`, the gateway is set up for you. + In case your service has the [model mapping](#model-mapping) configured, `dstack` will + automatically make your model available at `https://gateway.` via the OpenAI-compatible interface. -## Using the CLI +If you're using the cloud version of `dstack`, the gateway is set up for you. -### Define a configuration +## Define a configuration First, create a YAML file in your project folder. Its name must end with `.dstack.yml` (e.g. `.dstack.yml` or `train.dstack.yml` are both acceptable). @@ -49,27 +48,86 @@ are both acceptable). type: service image: ghcr.io/huggingface/text-generation-inference:latest +env: + - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1 +port: 80 +commands: + - text-generation-launcher --port 80 --trust-remote-code +``` -env: - - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ + -port: 80 +The `image` property is optional. If not specified, `dstack` uses its own Docker image, +pre-configured with Python, Conda, and essential CUDA drivers. + +If you run such a configuration, once the service is up, you'll be able to +access it at `https://.` (see how to [set up a gateway](#set-up-a-gateway)). + +!!! info "Configuration options" + Configuration file allows you to specify a custom Docker image, environment variables, and many other + options. For more details, refer to the [Reference](../reference/dstack.yml.md#service). + +### Model mapping + +If your service is running a model, you can configure the model mapping to be able to access it via the +OpenAI-compatible interface. +
+ +```yaml +type: service + +image: ghcr.io/huggingface/text-generation-inference:latest +env: + - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1 +port: 80 commands: - - text-generation-launcher --hostname 0.0.0.0 --port 80 --trust-remote-code + - text-generation-launcher --port 80 --trust-remote-code + +model: + type: chat + name: mistralai/Mistral-7B-Instruct-v0.1 + format: tgi ```
-By default, `dstack` uses its own Docker images to run dev environments, -which are pre-configured with Python, Conda, and essential CUDA drivers. +In this case, with such a configuration, once the service is up, you'll be able to access the model at +`https://gateway.` via the OpenAI-compatible interface. -!!! info "Configuration options" - Configuration file allows you to specify a custom Docker image, environment variables, and many other - options. - For more details, refer to the [Reference](../reference/dstack.yml.md#service). +#### Chat template + +By default, `dstack` loads the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating) +from the model's repository. If it is not present there, manual configuration is required. + +```yaml +type: service -### Run the configuration +image: ghcr.io/huggingface/text-generation-inference:latest +env: + - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ +port: 80 +commands: + - text-generation-launcher --port 80 --trust-remote-code --quantize gptq + +model: + type: chat + name: TheBloke/Llama-2-13B-chat-GPTQ + format: tgi + chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' }}{% endif %}{% endfor %}" + eos_token: "" +``` + +??? info "Limitations" + Note that model mapping is an experimental feature, and it has the following limitations: + + 1. Doesn't work if your `chat_template` uses `bos_token`. As a workaround, replace `bos_token` inside `chat_template` with the token content itself. + 2. Doesn't work if `eos_token` is defined in the model repository as a dictionary. As a workaround, set `eos_token` manually, as shown in the example above (see Chat template). + 3. Only works if you're using Text Generation Inference. Support for vLLM and other serving frameworks is coming later. + + If you encounter any other issues, please make sure to file a [GitHub issue](https://github.com/dstackai/dstack/issues/new/choose). + +## Run the configuration To run a configuration, use the `dstack run` command followed by the working directory path, configuration file path, and any other options (e.g., for requesting hardware resources). @@ -89,22 +147,57 @@ Continue? [y/n]: y Provisioning... ---> 100% -Serving HTTP on https://yellow-cat-1.example.com ... +Service is published at https://yellow-cat-1.example.com ``` -Once the service is deployed, its endpoint will be available at -`https://.` (using the domain [set up for the gateway](#set-up-a-gateway)). - !!! info "Run options" The `dstack run` command allows you to use `--gpu` to request GPUs (e.g. `--gpu A100` or `--gpu 80GB` or `--gpu A100:4`, etc.), and many other options (incl. spot instances, disk size, max price, max duration, retry policy, etc.). For more details, refer to the [Reference](../reference/cli/index.md#dstack-run). -[//]: # (TODO: Example) +### Service endpoint + +Once the service is up, you'll be able to +access it at `https://.`. + +
+ +```shell +$ curl https://yellow-cat-1.example.com/generate \ + -X POST \ + -d '{"inputs":"<s>[INST] What is your favourite condiment?[/INST]"}' \ + -H 'Content-Type: application/json' +``` + +
+ +#### OpenAI interface + +In case the service has the [model mapping](#model-mapping) configured, you will also be able +to access the model at `https://gateway.` via the OpenAI-compatible interface. + +```python +from openai import OpenAI + + +client = OpenAI( + base_url="https://gateway.example.com", + api_key="none" +) + +completion = client.chat.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.1", + messages=[ + {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} + ] +) + +print(completion.choices[0].message) +``` -What's next? +## What's next? 1. Check the [Text Generation Inference](../../examples/tgi.md) and [vLLM](../../examples/vllm.md) examples 2. Read about [dev environments](../concepts/dev-environments.md) diff --git a/docs/docs/quickstart.md b/docs/docs/quickstart.md index b307e40a2..9884487e0 100644 --- a/docs/docs/quickstart.md +++ b/docs/docs/quickstart.md @@ -29,8 +29,7 @@ or `train.dstack.yml` are both acceptable). ```yaml type: dev-environment - python: "3.11" # (Optional) If not specified, your local version is used - + python: "3.11" ide: vscode ``` @@ -45,16 +44,17 @@ or `train.dstack.yml` are both acceptable). ```yaml type: task - python: "3.11" # (Optional) If not specified, your local version is used - + python: "3.11" + env: + - HF_HUB_ENABLE_HF_TRANSFER=1 commands: - - pip install -r requirements.txt - - python train.py + - pip install -r fine-tuning/qlora/requirements.txt + - python fine-tuning/qlora/train.py ``` - Ensure `requirements.txt` and `train.py` are in your folder; you can take them from our [`examples`](https://github.com/dstackai/dstack-examples/tree/main/fine-tuning/qlora). + Ensure `requirements.txt` and `train.py` are in your folder. You can take them from [`dstack-examples`](https://github.com/dstackai/dstack-examples/tree/main/fine-tuning/qlora). === "Service" @@ -66,22 +66,19 @@ or `train.dstack.yml` are both acceptable). type: service image: ghcr.io/huggingface/text-generation-inference:latest - - env: - - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ - + env: + - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1 port: 80 - commands: - - text-generation-launcher --hostname 0.0.0.0 --port 80 --trust-remote-code + - text-generation-launcher --port 80 --trust-remote-code ``` - + ## Run configuration Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`), the path to the -configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.) +configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.)
diff --git a/docs/examples/deploy-python.md b/docs/examples/deploy-python.md index ffb32f279..d0720f26d 100644 --- a/docs/examples/deploy-python.md +++ b/docs/examples/deploy-python.md @@ -139,7 +139,7 @@ run.refresh() ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ```shell git clone https://github.com/dstackai/dstack-examples diff --git a/docs/examples/llama-index.md b/docs/examples/llama-index.md index ace8dcff2..24ebc928a 100644 --- a/docs/examples/llama-index.md +++ b/docs/examples/llama-index.md @@ -98,7 +98,7 @@ The data is in the vector database! Now we can proceed with the part where we in This example assumes we're using an LLM deployed using [TGI](tgi.md). Once you deployed the model, make sure to set the `TGI_ENDPOINT_URL` environment variable -to its URL, e.g. `https://.` (or `http://localhost:` if it's deployed +to its URL, e.g. `https://.` (or `http://localhost:` if it's deployed as a task). We'll use this environment variable below.
@@ -214,7 +214,7 @@ using `dstack`. For more in-depth information, we encourage you to explore the d ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ## What's next? diff --git a/docs/examples/mixtral.md b/docs/examples/mixtral.md index 4868e3c11..9ade1e0ba 100644 --- a/docs/examples/mixtral.md +++ b/docs/examples/mixtral.md @@ -8,79 +8,85 @@ with `dstack`'s [services](../docs/concepts/services.md). To deploy Mixtral as a service, you have to define the corresponding configuration file. Below are multiple variants: via vLLM (`fp16`), TGI (`fp16`), or TGI (`int4`). -=== "vLLM `fp16`" +=== "TGI `fp16`" -
+
```yaml type: service - # This configuration deploys Mixtral in fp16 using vLLM - - python: "3.11" + image: ghcr.io/huggingface/text-generation-inference:latest + env: + - MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1 commands: - - pip install vllm - - python -m vllm.entrypoints.openai.api_server - --model mistralai/Mixtral-8X7B-Instruct-v0.1 - --host 0.0.0.0 - --tensor-parallel-size 2 # Should match the number of GPUs - - port: 8000 + - text-generation-launcher + --port 80 + --trust-remote-code + --num-shard 2 # Should match the number of GPUs + port: 80 + + # Optional mapping for OpenAI-compatible endpoint + model: + type: chat + name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ + format: tgi ```
-=== "TGI `fp16`" +=== "TGI `int4`" -
+
```yaml type: service - # This configuration deploys Mixtral in fp16 using TGI - - image: ghcr.io/huggingface/text-generation-inference:latest - + + image: ghcr.io/huggingface/text-generation-inference:latest env: - - MODEL_ID=mistralai/Mixtral-8x7B-Instruct-v0.1 - + - MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ commands: - - text-generation-launcher - --hostname 0.0.0.0 - --port 8000 + - text-generation-launcher + --port 80 --trust-remote-code - --num-shard 2 # Should match the number of GPUs - - port: 8000 + --quantize gptq + port: 80 + + # Optional mapping for OpenAI-compatible endpoint + model: + type: chat + name: TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ + format: tgi ```
-=== "TGI `int4`" +=== "vLLM `fp16`" -
+
```yaml type: service - # This configuration deploys Mixtral in int4 using TGI - - image: ghcr.io/huggingface/text-generation-inference:latest + # This configuration deploys Mixtral in fp16 using vLLM - env: - - MODEL_ID=TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ + python: "3.11" commands: - - text-generation-launcher - --hostname 0.0.0.0 - --port 8000 - --trust-remote-code - --quantize gptq + - pip install vllm + - python -m vllm.entrypoints.openai.api_server + --model mistralai/Mixtral-8X7B-Instruct-v0.1 + --host 0.0.0.0 + --tensor-parallel-size 2 # Should match the number of GPUs port: 8000 ```
-> vLLM's support for quantized Mixtral is not yet stable. + !!! info "NOTE:" + The [model mapping](../docs/concepts/services.md#model-mapping) to access the model via the + gateway's OpenAI-compatible endpoint is not yet supported for vLLM. + + Also, support for quantized Mixtral in vLLM is not yet stable. ## Run the configuration @@ -88,69 +94,72 @@ Below are multiple variants: via vLLM (`fp16`), TGI (`fp16`), or TGI (`int4`). Before running a service, make sure to set up a [gateway](../docs/concepts/services.md#set-up-a-gateway). However, it's not required when using dstack Cloud, as it's set up automatically. -!!! info "Resources" - For `fp16`, deployment of Mixtral, ensure a minimum total GPU memory of `100GB` and disk size of `200GB`. - Also, make sure to adjust the `--tensor-parallel-size` and `--num-shard` parameters in the YAML configuration to align - with the number of GPUs used. - For `int4`, request at least `25GB` of GPU memory. +For `fp16`, deployment of Mixtral, ensure a minimum total GPU memory of `100GB` and disk size of `200GB`. +For `int4`, request at least `25GB` of GPU memory. -=== "vLLM `fp16`" +[//]: # ( Also, make sure to adjust the `--tensor-parallel-size` and `--num-shard` parameters in the YAML configuration to align) +[//]: # ( with the number of GPUs used.) + + +=== "TGI `fp16`"
```shell - $ dstack run . -f llms/mixtral/vllm.dstack.yml --gpu "80GB:2" --disk 200GB + $ dstack run . -f llms/mixtral/tgi.dstack.yml --gpu "80GB:2" --disk 200GB ```
-=== "TGI `fp16`" +=== "TGI `int4`"
```shell - $ dstack run . -f llms/mixtral/tgi.dstack.yml --gpu "80GB:2" --disk 200GB + $ dstack run . -f llms/mixtral/tgi-gptq.dstack.yml --gpu 25GB ```
-=== "TGI `int4`" +=== "vLLM `fp16`"
```shell - $ dstack run . -f llms/mixtral/tgi-gptq.dstack.yml --gpu 25GB + $ dstack run . -f llms/mixtral/vllm.dstack.yml --gpu "80GB:2" --disk 200GB ```
-!!! info "Endpoint URL" - Once the service is deployed, its endpoint will be available at - `https://.` (using the domain set up for the gateway). - - If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. - -[//]: # (Once the service is up, you can query it via it's OpenAI compatible endpoint:) -[//]: # (
) -[//]: # () -[//]: # (```shell) -[//]: # ($ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \) -[//]: # ( -H "Content-Type: application/json" \) -[//]: # ( -d '{) -[//]: # ( "model": "mistralai/Mixtral-8X7B-Instruct-v0.1",) -[//]: # ( "prompt": "Hello!",) -[//]: # ( "max_tokens": 25,) -[//]: # ( }') -[//]: # (```) -[//]: # () -[//]: # (
) - -[//]: # (!!! info "OpenAI-compatible API") -[//]: # ( Since vLLM provides an OpenAI-compatible endpoint, feel free to access it using various OpenAI-compatible tools like) -[//]: # ( Chat UI, LangChain, Llama Index, etc. ) +## Access the endpoint + +Once the service is up, you'll be able to access it at `https://.`. + +#### OpenAI interface + +In case the service has the [model mapping](../docs/concepts/services.md#model-mapping) configured, you will also be able +to access the model at `https://gateway.` via the OpenAI-compatible interface. + +```python +from openai import OpenAI -??? info "Hugging Face Hub token" +client = OpenAI( + base_url="https://gateway.example.com", + api_key="none" +) + +completion = client.chat.completions.create( + model="mistralai/Mixtral-8x7B-Instruct-v0.1", + messages=[ + {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} + ] +) + +print(completion.choices[0].message) +``` + +??? info "Hugging Face Hub token" To use a model with gated access, ensure configuring the `HUGGING_FACE_HUB_TOKEN` environment variable (with [`--env`](../docs/reference/cli/index.md#dstack-run) in `dstack run` or using [`env`](../docs/reference/dstack.yml.md#service) in the configuration file). @@ -164,11 +173,11 @@ Below are multiple variants: via vLLM (`fp16`), TGI (`fp16`), or TGI (`int4`). ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ## What's next? -1. Check the [vLLM](tgi.md) and [Text Generation Inference](tgi.md) examples +1. Check the [Text Generation Inference](tgi.md) and [vLLM](vllm.md) examples 2. Read about [services](../docs/concepts/services.md) 3. Browse [examples](index.md) 4. Join the [Discord server](https://discord.gg/u8SmfwPpMd) \ No newline at end of file diff --git a/docs/examples/sdxl.md b/docs/examples/sdxl.md index bf3991662..acc55b9e0 100644 --- a/docs/examples/sdxl.md +++ b/docs/examples/sdxl.md @@ -204,13 +204,9 @@ $ dstack run . -f stable-diffusion-xl/api.dstack.yml
-!!! info "Endpoint URL" - Once the service is deployed, its endpoint will be available at - `https://.` (using the domain set up for the gateway). - - If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. - -Once the service is up, you can query the endpoint: +## Access the endpoint +Once the service is up, you can query it at +`https://.` (using the domain set up for the gateway):
@@ -224,7 +220,7 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/generate \ ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ## What's next? diff --git a/docs/examples/spot.md b/docs/examples/spot.md index 05b433664..57d9320b8 100644 --- a/docs/examples/spot.md +++ b/docs/examples/spot.md @@ -4,7 +4,7 @@ Cloud instances come in three types: `reserved` (for long-term commitments at a more expensive), and `spot` (cheapest, provided when available, but can be taken away when requested by someone else). There are three cloud providers that offer spot instances: AWS, GCP, and Azure. -Once you've [configured](../docs/config/server.md) any of these, you can use spot instances +Once you've [configured](../docs/installation/index.md) any of these, you can use spot instances for [dev environments](../docs/concepts/dev-environments.md), [tasks](../docs/concepts/tasks.md), and [services](../docs/concepts/services.md). diff --git a/docs/examples/tei.md b/docs/examples/tei.md index a306a20e9..55a48da6c 100644 --- a/docs/examples/tei.md +++ b/docs/examples/tei.md @@ -7,21 +7,17 @@ This example demonstrates how to use [TEI](https://github.com/huggingface/text-e To deploy a text embeddings model as a service using TEI, define the following configuration file: -
+
```yaml type: service image: ghcr.io/huggingface/text-embeddings-inference:latest - env: - MODEL_ID=thenlper/gte-base - -port: 8000 - commands: - - text-embeddings-router --hostname 0.0.0.0 --port 8000 - + - text-embeddings-router --port 80 +port: 80 ```
@@ -35,23 +31,20 @@ commands:
```shell -$ dstack run . -f text-embeddings-inference/embeddings.dstack.yml --gpu 24GB +$ dstack run . -f deployment/tae/serve.dstack.yml --gpu 24GB ```
-!!! info "Endpoint URL" - Once the service is deployed, its endpoint will be available at - `https://.` (using the domain set up for the gateway). - - If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. - -Once the service is up, you can query it: +## Access the endpoint + +Once the service is up, you can query it at +`https://.` (using the domain set up for the gateway):
```shell -$ curl https://yellow-cat-1.mydomain.com \ +$ curl https://yellow-cat-1.example.com \ -X POST \ -H 'Content-Type: application/json' \ -d '{"inputs":"What is Deep Learning?"}' @@ -100,7 +93,7 @@ $ curl https://yellow-cat-1.mydomain.com \ [//]: # () [//]: # (# Specify your service url) -[//]: # (EMBEDDINGS_URL = "https://tall-octopus-1.mydomain.com") +[//]: # (EMBEDDINGS_URL = "https://tall-octopus-1.example.com") [//]: # () [//]: # (embedding=HuggingFaceInferenceAPIEmbeddings() @@ -231,5 +224,13 @@ $ curl https://yellow-cat-1.mydomain.com \ [//]: # ( you'll have to split your texts into batches and add them to vector store via `vectorstore.add_texts()`.) -!!! info "Source code" - The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). \ No newline at end of file +## Source code + +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). + +## What's next? + +1. Check the [Text Generation Inference](tgi.md) and [vLLM](vllm.md) examples +2. Read about [services](../docs/concepts/services.md) +3. Browse all [examples](index.md) +4. Join the [Discord server](https://discord.gg/u8SmfwPpMd) \ No newline at end of file diff --git a/docs/examples/tgi.md b/docs/examples/tgi.md index d715470a6..854a485ce 100644 --- a/docs/examples/tgi.md +++ b/docs/examples/tgi.md @@ -12,18 +12,26 @@ To deploy an LLM as a service using TGI, you have to define the following config type: service image: ghcr.io/huggingface/text-generation-inference:latest - env: - - MODEL_ID=NousResearch/Llama-2-7b-hf - -port: 8000 - -commands: - - text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code + - MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1 +port: 80 +commands: + - text-generation-launcher --port 80 --trust-remote-code + +# Optional mapping for OpenAI interface +model: + type: chat + name: mistralai/Mistral-7B-Instruct-v0.1 + format: tgi ```
+!!! info "Model mapping" + Note the `model` property is optional and is only required + if you're running a chat model and want to access it via an OpenAI-compatible endpoint. + For more details on how to use it feature, check the documentation on [services](../docs/concepts/services.md). + ## Run the configuration !!! warning "Gateway" @@ -38,29 +46,46 @@ $ dstack run . -f text-generation-inference/serve.dstack.yml --gpu 24GB
-!!! info "Endpoint URL" - Once the service is deployed, its endpoint will be available at - `https://.` (using the domain set up for the gateway). - - If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. +### Access the endpoint -Once the service is up, you can query it: +Once the service is up, you'll be able to +access it at `https://.`.
```shell -$ curl -X POST --location https://yellow-cat-1.mydomain.com/generate \ - -H 'Content-Type: application/json' \ - -d '{ - "inputs": "What is Deep Learning?", - "parameters": { - "max_new_tokens": 20 - } - }' +$ curl https://yellow-cat-1.example.com/generate \ + -X POST \ + -d '{"inputs":"<s>[INST] What is your favourite condiment?[/INST]"}' \ + -H 'Content-Type: application/json' ```
+#### OpenAI interface + +Because we've configured the model mapping, it will also be possible +to access the model at `https://gateway.` via the OpenAI-compatible interface. + +```python +from openai import OpenAI + + +client = OpenAI( + base_url="https://gateway.example.com", + api_key="none" +) + +completion = client.chat.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.1", + messages=[ + {"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."} + ] +) + +print(completion.choices[0].message) +``` + !!! info "Hugging Face Hub token" To use a model with gated access, ensure configuring the `HUGGING_FACE_HUB_TOKEN` environment variable @@ -76,13 +101,9 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/generate \ ```
-### Quantization +## Quantization -An LLM typically requires twice the GPU memory compared to its parameter count. For instance, a model with `13B` parameters -needs around `26GB` of GPU memory. To decrease memory usage and fit the model on a smaller GPU, consider using -quantization, which TGI offers as `bitsandbytes` and `gptq` methods. - -Here's an example of the Llama 2 13B model tailored for a `24GB` GPU (A10 or L4): +Here's an example of using TGI with quantization:
@@ -90,26 +111,25 @@ Here's an example of the Llama 2 13B model tailored for a `24GB` GPU (A10 or L4) type: service image: ghcr.io/huggingface/text-generation-inference:latest - env: - - MODEL_ID=TheBloke/Llama-2-13B-GPTQ - -port: 8000 - -commands: - - text-generation-launcher --hostname 0.0.0.0 --port 8000 --trust-remote-code --quantize gptq + - MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ +port: 80 +commands: + - text-generation-launcher --port 80 --trust-remote-code --quantize gptq + +model: + type: chat + name: TheBloke/Llama-2-13B-chat-GPTQ + format: tgi + chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<>\\n' + system_message + '\\n<>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' }}{% endif %}{% endfor %}" + eos_token: "" ```
-A similar approach allows running the Llama 2 70B model on an `40GB` GPU (A100). - -To calculate the exact GPU memory required for a specific model with different quantization methods, you can use the -[hf-accelerate/memory-model-usage](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) Space. - ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ## What's next? diff --git a/docs/examples/vllm.md b/docs/examples/vllm.md index ce3048f44..1806740ef 100644 --- a/docs/examples/vllm.md +++ b/docs/examples/vllm.md @@ -12,15 +12,12 @@ To deploy an LLM as a service using vLLM, you have to define the following confi type: service python: "3.11" - env: - MODEL=NousResearch/Llama-2-7b-hf - -port: 8000 - commands: - pip install vllm - python -m vllm.entrypoints.openai.api_server --model $MODEL --port 8000 +port: 8000 ```
@@ -39,18 +36,15 @@ $ dstack run . -f vllm/serve.dstack.yml --gpu 24GB
-!!! info "Endpoint URL" - Once the service is deployed, its endpoint will be available at - `https://.` (using the domain set up for the gateway). - - If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. +## Access the endpoint -Once the service is up, you can query it: +Once the service is up, you can query it at +`https://.` (using the domain set up for the gateway):
```shell -$ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \ +$ curl -X POST --location https://yellow-cat-1.example.com/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NousResearch/Llama-2-7b-hf", @@ -77,7 +71,7 @@ $ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \ ## Source code -The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). +The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples). ## What's next? diff --git a/docs/overrides/home.html b/docs/overrides/home.html index 3dfd30549..11c5aeedd 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -253,7 +253,7 @@

Deployment

- +
diff --git a/docs/overrides/landing.html b/docs/overrides/landing.html index 6fe6f5b3e..b0d857b20 100644 --- a/docs/overrides/landing.html +++ b/docs/overrides/landing.html @@ -11,7 +11,7 @@ {% endblock %} {% block announce %} -🔥 dstack 0.14.0rc1 is here! Deploy custom LLMs via OpenAI-compatible endpoints! Learn more. +🔥 dstack 0.14.0 is here! Deploy custom LLMs via OpenAI-compatible endpoints! Learn more. {% endblock %} {% block footer %} diff --git a/docs/overrides/main.html b/docs/overrides/main.html index acba75d42..fd352c170 100644 --- a/docs/overrides/main.html +++ b/docs/overrides/main.html @@ -24,5 +24,5 @@ {% endblock %} {% block announce %} -🔥 dstack 0.14.0rc1 is here! Deploy custom LLMs via OpenAI-compatible endpoints! Learn more. +🔥 dstack 0.14.0 is here! Deploy custom LLMs via OpenAI-compatible endpoints! Learn more. {% endblock %} \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index 5b0dd717f..8073c72a8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -81,7 +81,6 @@ plugins: 'examples/finetuning-llama-2.md': 'examples/qlora.md' 'examples/text-generation-inference.md': 'examples/tgi.md' 'examples/stable-diffusion-xl.md': 'examples/sdxl.md' - 'examples/vllm.md': 'examples/vllm.md' 'learn/mixtral.md': 'examples/mixtral.md' 'learn/tei.md': 'examples/tei.md' 'learn/llama-index.md': 'examples/llama-index.md' @@ -189,14 +188,14 @@ nav: - Fine-tuning: - QLoRA: examples/qlora.md - Deployment: - - vLLM: examples/vllm.md - Text Generation Inference: examples/tgi.md + - vLLM: examples/vllm.md - Text Embedding Interface: examples/tei.md - SDXL: examples/sdxl.md - RAG: - Llama Index: examples/llama-index.md + - Discord: https://discord.gg/u8SmfwPpMd # - Blog: # - blog/index.md - - Discord: https://discord.gg/u8SmfwPpMd - Platform: platform.md - GitHub: https://github.com/dstackai/dstack \ No newline at end of file diff --git a/src/dstack/_internal/core/models/configurations.py b/src/dstack/_internal/core/models/configurations.py index 9db283b96..34e3dc06d 100644 --- a/src/dstack/_internal/core/models/configurations.py +++ b/src/dstack/_internal/core/models/configurations.py @@ -79,6 +79,17 @@ class Artifact(ForbidExtra): class ModelInfo(ForbidExtra): + """ + Mapping of the model for the OpenAI-compatible endpoint. + + Attributes: + type (str): The type of the model, e.g. "chat" + name (str): The name of the model. This name will be used both to load model configuration from the HuggingFace Hub and in the OpenAI-compatible endpoint. + format (str): The format of the model, e.g. "tgi" if the model is served with HuggingFace's Text Generation Inference. + chat_template (Optional[str]): The custom prompt template for the model. If not specified, the default prompt template the HuggingFace Hub configuration will be used. + eos_token (Optional[str]): The custom end of sentence token. If not specified, the default custom end of sentence token from the HuggingFace Hub configuration will be used. + """ + type: Annotated[Literal["chat"], Field(description="The type of the model")] name: Annotated[str, Field(description="The name of the model")] format: Annotated[Literal["tgi"], Field(description="The serving format")] @@ -184,6 +195,7 @@ class ServiceConfiguration(BaseConfiguration): registry_auth (Optional[RegistryAuth]): Credentials for pulling a private Docker image home_dir (str): The absolute path to the home directory inside the container. Defaults to `/root`. resources (Optional[Resources]): The requirements to run the configuration. + model (Optional[ModelMapping]): Mapping of the model for the OpenAI-compatible endpoint. """ type: Literal["service"] = "service" @@ -193,7 +205,8 @@ class ServiceConfiguration(BaseConfiguration): Field(description="The port, that application listens to or the mapping"), ] model: Annotated[ - Optional[ModelInfo], Field(description="The model info for OpenAI interface") + Optional[ModelInfo], + Field(description="Mapping of the model for the OpenAI-compatible endpoint"), ] = None @validator("port") diff --git a/src/dstack/api/__init__.py b/src/dstack/api/__init__.py index 5a76f2c9d..50ee3b551 100644 --- a/src/dstack/api/__init__.py +++ b/src/dstack/api/__init__.py @@ -1,5 +1,6 @@ from dstack._internal.core.errors import ClientError from dstack._internal.core.models.backends.base import BackendType +from dstack._internal.core.models.configurations import ModelInfo as _ModelInfo from dstack._internal.core.models.configurations import RegistryAuth from dstack._internal.core.models.configurations import ( ServiceConfiguration as _ServiceConfiguration, @@ -16,5 +17,6 @@ from dstack.api._public.huggingface.finetuning.sft import FineTuningTask from dstack.api._public.runs import Run, RunStatus +ModelMapping = _ModelInfo Service = _ServiceConfiguration Task = _TaskConfiguration