Skip to content

Commit

Permalink
- [Docs] Updated the 0.14.0rc1 blog post (changed it to 0.14.0); …
Browse files Browse the repository at this point in the history
…also updated the banner

- [Docs] Added the OpenAI model mapping instructions into the documentation and examples (WIP)
  • Loading branch information
peterschmidt85 committed Jan 25, 2024
1 parent 8f0fd42 commit 5b48bdf
Show file tree
Hide file tree
Showing 19 changed files with 349 additions and 238 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Supported providers: AWS, GCP, Azure, Lambda, TensorDock, Vast.ai, and DataCrunc

## Latest news ✨

- [2024/01] [dstack 0.14.0rc1: OpenAI-compatible endpoints](https://dstack.ai/blog/2024/01/19/openai-endpoints-preview/) (Preview)
- [2024/01] [dstack 0.14.0: OpenAI-compatible endpoints preview](https://dstack.ai/blog/2024/01/19/openai-endpoints-preview/) (Release)
- [2023/12] [dstack 0.13.0: Disk size, CUDA 12.1, Mixtral, and more](https://dstack.ai/blog/2023/12/22/disk-size-cuda-12-1-mixtral-and-more/) (Release)
- [2023/11] [dstack 0.12.3: Vast.ai integration](https://dstack.ai/blog/2023/11/21/vastai/) (Release)
- [2023/10] [dstack 0.12.2: TensorDock integration](https://dstack.ai/blog/2023/10/31/tensordock/) (Release)
Expand Down
41 changes: 14 additions & 27 deletions docs/blog/posts/openai-endpoints-preview.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,23 @@
---
title: "dstack 0.14.0rc1: OpenAI-compatible endpoints preview"
title: "dstack 0.14.0: OpenAI-compatible endpoints preview"
date: 2024-01-19
description: "Making it easier to deploy custom LLMs as OpenAI-compatible endpoints."
slug: "openai-endpoints-preview"
categories:
- Previews
- Releases
---

# dstack 0.14.0rc1: OpenAI-compatible endpoints preview
# dstack 0.14.0: OpenAI-compatible endpoints preview

__Making it easier to deploy custom LLMs as OpenAI-compatible endpoints.__

The `service` configuration deploys any application as a public endpoint. For instance, you can use HuggingFace's
[TGI](https://github.com/huggingface/text-generation-inference) or frameworks to deploy
custom LLMs. While this is simple and customizable, using different frameworks and LLMs complicates
the integration of LLMs.
[TGI](https://github.com/huggingface/text-generation-inference) or other frameworks to deploy custom LLMs.
While this is simple and customizable, using different frameworks and LLMs complicates the integration of LLMs.

<!-- more -->

With the upcoming `dstack 0.14.0`, we are extending the `service` configuration in `dstack` to enable you to optionally map your
With `dstack 0.14.0`, we are extending the `service` configuration in `dstack` to enable you to optionally map your
custom LLM to an OpenAI-compatible endpoint.

Here's how it works: you define a `service` (as before) and include the `model` property with
Expand All @@ -42,7 +41,7 @@ model:
format: tgi
```
When you deploy this service using `dstack run`, `dstack` will automatically publish the OpenAI-compatible endpoint,
When you deploy the service using `dstack run`, `dstack` will automatically publish the OpenAI-compatible endpoint,
converting the prompt and response format between your LLM and OpenAI interface.

```python
Expand All @@ -63,32 +62,20 @@ completion = client.chat.completions.create(
print(completion.choices[0].message)
```

!!! info "NOTE:"
By default, dstack loads the model's `chat_template` and `eos_token` from Hugging Face. However, you can override them using
the corresponding properties under `model`.

Here's a live demo of how it works:

<img src="https://raw.githubusercontent.com/dstackai/static-assets/main/static-assets/images/dstack-openai-python.gif" style="width:100%; max-width: 800px;" />

## Try the preview

To try the preview of this new upcoming feature, make sure to install `0.14.0rc1` and restart your server.

```shell
pip install "dstack[all]==0.14.0rc1"
```
For more details on how to use the new feature, be sure to check the updated documentation on [services](../../docs/concepts/services.md),
and the [TGI](../../examples/tgi.md) example.

## Migration guide

Note: In order to use the new feature, it's important to delete your existing gateway (if any)
using `dstack gateway delete` and then create it again with `dstack gateway create`.

## Why does this matter?

With `dstack`, you can train and deploy models using any cloud providers, easily leveraging GPU availability across
providers, spot instances, multiple regions, and more.
Note: After you update to `0.14.0`, it's important to delete your existing gateways (if any)
using `dstack gateway delete` and create them again with `dstack gateway create`.

## Feedback

Do you have any questions or need assistance? Feel free to join our [Discord server](https://discord.gg/u8SmfwPpMd).
In case you have any questions, experience bugs, or need help,
drop us a message on our [Discord server](https://discord.gg/u8SmfwPpMd) or submit it as a
[GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).
4 changes: 2 additions & 2 deletions docs/blog/posts/simplified-cloud-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ projects:
</div>
Regions and other settings are optional. Learn more on what credential types are supported
via [Clouds](../../docs/config/server.md).
via [Clouds](../../docs/installation/index.md).
## Enhanced API
Expand Down Expand Up @@ -97,7 +97,7 @@ This means you'll need to delete `~/.dstack` and configure `dstack` from scratch

1. `pip install "dstack[all]==0.12.0"`
2. Delete `~/.dstack`
3. Configure clouds via `~/.dstack/server/config.yml` (see the [new guide](../../docs/config/server.md))
3. Configure clouds via `~/.dstack/server/config.yml` (see the [new guide](../../docs/installation/index.md))
4. Run `dstack server`

The [documentation](../../docs/index.md) and [examples](../../examples/index.md) are updated.
Expand Down
145 changes: 119 additions & 26 deletions docs/docs/concepts/services.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# Services

With `dstack`, you can use the CLI or API to deploy models or web apps.
Provide the commands, port, and choose the Python version or a Docker image.

`dstack` handles the deployment on configured cloud GPU provider(s) with the necessary resources.
Services make it easy to deploy models and apps as public endpoints, allowing you to use any
frameworks.

??? info "Prerequisites"

Expand Down Expand Up @@ -31,14 +29,15 @@ Provide the commands, port, and choose the Python version or a Docker image.
Afterward, in your domain's DNS settings, add an `A` DNS record for `*.example.com`
pointing to the IP address of the gateway.

This way, if you run a service, `dstack` will make its endpoint available at
`https://<run-name>.example.com`.
Now, if you run a service, `dstack` will make its endpoint available at
`https://<run name>.<gateway domain>`.

If you're using the cloud version of `dstack`, the gateway is set up for you.
In case your service has the [model mapping](#model-mapping) configured, `dstack` will
automatically make your model available at `https://gateway.<gateway domain>` via the OpenAI-compatible interface.

## Using the CLI
If you're using the cloud version of `dstack`, the gateway is set up for you.

### Define a configuration
## Define a configuration

First, create a YAML file in your project folder. Its name must end with `.dstack.yml` (e.g. `.dstack.yml` or `train.dstack.yml`
are both acceptable).
Expand All @@ -49,27 +48,86 @@ are both acceptable).
type: service

image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
port: 80
commands:
- text-generation-launcher --port 80 --trust-remote-code
```
env:
- MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
</div>
port: 80
The `image` property is optional. If not specified, `dstack` uses its own Docker image,
pre-configured with Python, Conda, and essential CUDA drivers.

If you run such a configuration, once the service is up, you'll be able to
access it at `https://<run name>.<gateway domain>` (see how to [set up a gateway](#set-up-a-gateway)).

!!! info "Configuration options"
Configuration file allows you to specify a custom Docker image, environment variables, and many other
options. For more details, refer to the [Reference](../reference/dstack.yml.md#service).

### Model mapping

If your service is running a model, you can configure the model mapping to be able to access it via the
OpenAI-compatible interface.

<div editor-title="serve.dstack.yml">

```yaml
type: service
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
port: 80
commands:
- text-generation-launcher --hostname 0.0.0.0 --port 80 --trust-remote-code
- text-generation-launcher --port 80 --trust-remote-code
model:
type: chat
name: mistralai/Mistral-7B-Instruct-v0.1
format: tgi
```

</div>

By default, `dstack` uses its own Docker images to run dev environments,
which are pre-configured with Python, Conda, and essential CUDA drivers.
In this case, with such a configuration, once the service is up, you'll be able to access the model at
`https://gateway.<gateway domain>` via the OpenAI-compatible interface.

!!! info "Configuration options"
Configuration file allows you to specify a custom Docker image, environment variables, and many other
options.
For more details, refer to the [Reference](../reference/dstack.yml.md#service).
#### Chat template

By default, `dstack` loads the [chat template](https://huggingface.co/docs/transformers/main/en/chat_templating)
from the model's repository. If it is not present there, manual configuration is required.

```yaml
type: service
### Run the configuration
image: ghcr.io/huggingface/text-generation-inference:latest
env:
- MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ
port: 80
commands:
- text-generation-launcher --port 80 --trust-remote-code --quantize gptq
model:
type: chat
name: TheBloke/Llama-2-13B-chat-GPTQ
format: tgi
chat_template: "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ '<s>[INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' </s>' }}{% endif %}{% endfor %}"
eos_token: "</s>"
```

??? info "Limitations"
Note that model mapping is an experimental feature, and it has the following limitations:

1. Doesn't work if your `chat_template` uses `bos_token`. As a workaround, replace `bos_token` inside `chat_template` with the token content itself.
2. Doesn't work if `eos_token` is defined in the model repository as a dictionary. As a workaround, set `eos_token` manually, as shown in the example above (see Chat template).
3. Only works if you're using Text Generation Inference. Support for vLLM and other serving frameworks is coming later.

If you encounter any other issues, please make sure to file a [GitHub issue](https://github.com/dstackai/dstack/issues/new/choose).

## Run the configuration

To run a configuration, use the `dstack run` command followed by the working directory path,
configuration file path, and any other options (e.g., for requesting hardware resources).
Expand All @@ -89,22 +147,57 @@ Continue? [y/n]: y
Provisioning...
---> 100%
Serving HTTP on https://yellow-cat-1.example.com ...
Service is published at https://yellow-cat-1.example.com
```

</div>

Once the service is deployed, its endpoint will be available at
`https://<run-name>.<domain-name>` (using the domain [set up for the gateway](#set-up-a-gateway)).

!!! info "Run options"
The `dstack run` command allows you to use `--gpu` to request GPUs (e.g. `--gpu A100` or `--gpu 80GB` or `--gpu A100:4`, etc.),
and many other options (incl. spot instances, disk size, max price, max duration, retry policy, etc.).
For more details, refer to the [Reference](../reference/cli/index.md#dstack-run).

[//]: # (TODO: Example)
### Service endpoint

Once the service is up, you'll be able to
access it at `https://<run name>.<gateway domain>`.

<div class="termy">

```shell
$ curl https://yellow-cat-1.example.com/generate \
-X POST \
-d '{"inputs":"&lt;s&gt;[INST] What is your favourite condiment?[/INST]"}' \
-H 'Content-Type: application/json'
```

</div>

#### OpenAI interface

In case the service has the [model mapping](#model-mapping) configured, you will also be able
to access the model at `https://gateway.<gateway domain>` via the OpenAI-compatible interface.

```python
from openai import OpenAI
client = OpenAI(
base_url="https://gateway.example.com",
api_key="none"
)
completion = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[
{"role": "user", "content": "Compose a poem that explains the concept of recursion in programming."}
]
)
print(completion.choices[0].message)
```

What's next?
## What's next?

1. Check the [Text Generation Inference](../../examples/tgi.md) and [vLLM](../../examples/vllm.md) examples
2. Read about [dev environments](../concepts/dev-environments.md)
Expand Down
27 changes: 12 additions & 15 deletions docs/docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,7 @@ or `train.dstack.yml` are both acceptable).
```yaml
type: dev-environment

python: "3.11" # (Optional) If not specified, your local version is used

python: "3.11"
ide: vscode
```

Expand All @@ -45,16 +44,17 @@ or `train.dstack.yml` are both acceptable).
```yaml
type: task

python: "3.11" # (Optional) If not specified, your local version is used

python: "3.11"
env:
- HF_HUB_ENABLE_HF_TRANSFER=1
commands:
- pip install -r requirements.txt
- python train.py
- pip install -r fine-tuning/qlora/requirements.txt
- python fine-tuning/qlora/train.py
```

</div>

Ensure `requirements.txt` and `train.py` are in your folder; you can take them from our [`examples`](https://github.com/dstackai/dstack-examples/tree/main/fine-tuning/qlora).
Ensure `requirements.txt` and `train.py` are in your folder. You can take them from [`dstack-examples`](https://github.com/dstackai/dstack-examples/tree/main/fine-tuning/qlora).

=== "Service"

Expand All @@ -66,22 +66,19 @@ or `train.dstack.yml` are both acceptable).
type: service

image: ghcr.io/huggingface/text-generation-inference:latest

env:
- MODEL_ID=TheBloke/Llama-2-13B-chat-GPTQ

env:
- MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1
port: 80

commands:
- text-generation-launcher --hostname 0.0.0.0 --port 80 --trust-remote-code
- text-generation-launcher --port 80 --trust-remote-code
```

</div>

## Run configuration

Run a configuration using the [`dstack run`](reference/cli/index.md#dstack-run) command, followed by the working directory path (e.g., `.`), the path to the
configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.)
configuration file, and run options (e.g., configuring hardware resources, spot policy, etc.)

<div class="termy">

Expand Down
2 changes: 1 addition & 1 deletion docs/examples/deploy-python.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,7 +139,7 @@ run.refresh()

## Source code

The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples).
The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples).

```shell
git clone https://github.com/dstackai/dstack-examples
Expand Down
4 changes: 2 additions & 2 deletions docs/examples/llama-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ The data is in the vector database! Now we can proceed with the part where we in
This example assumes we're using an LLM deployed using [TGI](tgi.md).

Once you deployed the model, make sure to set the `TGI_ENDPOINT_URL` environment variable
to its URL, e.g. `https://<run-name>.<domain-name>` (or `http://localhost:<port>` if it's deployed
to its URL, e.g. `https://<run name>.<gateway domain>` (or `http://localhost:<port>` if it's deployed
as a task). We'll use this environment variable below.

<div class="termy">
Expand Down Expand Up @@ -214,7 +214,7 @@ using `dstack`. For more in-depth information, we encourage you to explore the d

## Source code

The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples).
The complete, ready-to-run code is available in [`dstackai/dstack-examples`](https://github.com/dstackai/dstack-examples).

## What's next?

Expand Down
Loading

0 comments on commit 5b48bdf

Please sign in to comment.