diff --git a/docs/learn/mixtral.md b/docs/learn/mixtral.md new file mode 100644 index 000000000..6ccc29b42 --- /dev/null +++ b/docs/learn/mixtral.md @@ -0,0 +1,102 @@ +# Mixtral + +This example demonstrates how to deploy `mistralai/Mixtral-8x7B-Instruct-v0.1 ` +with `dstack`'s [services](../docs/guides/services.md) and [vLLM](https://vllm.ai/). + +## Define the configuration + +To deploy Mixtral as a service using vLLM, define the following configuration file: + +
+ +```yaml +type: service + +python: "3.11" + +commands: + - conda install cuda # (required by megablocks) + - pip install torch # (required by megablocks) + - pip install vllm megablocks + - python -m vllm.entrypoints.openai.api_server + --model mistralai/Mixtral-8X7B-Instruct-v0.1 + --host 0.0.0.0 + --tensor-parallel-size 2 # should match the number of GPUs + +port: 8000 +``` + +
+ +## Run the configuration + +!!! warning "Prerequisites" + Before running a service, make sure to set up a [gateway](../docs/guides/services.md#set-up-a-gateway). + However, it's not required when using dstack Cloud, as it's set up automatically. + +
+ +```shell +$ dstack run . -f llms/mixtral.dstack.yml --gpu "80GB:2" --disk 200GB +``` + +
+ +!!! info "GPU memory" + To deploy Mixtral in `fp16`, ensure a minimum of `100GB` total GPU memory, + and adjust the `--tensor-parallel-size` parameter in the YAML configuration + to match the number of GPUs. + +!!! info "Disk size" + To deploy Mixtral, ensure a minimum of `200GB` of disk size. + +!!! info "Endpoint URL" + Once the service is deployed, its endpoint will be available at + `https://.` (using the domain set up for the gateway). + + If you wish to customize the run name, you can use the `-n` argument with the `dstack run` command. + +Once the service is up, you can query it via it's OpenAI compatible endpoint: + +
+ +```shell +$ curl -X POST --location https://yellow-cat-1.mydomain.com/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "NousResearch/Llama-2-7b-hf", + "prompt": "San Francisco is a", + "max_tokens": 7, + "temperature": 0 + }' +``` + +
+ +!!! info "OpenAI-compatible API" + Since vLLM provides an OpenAI-compatible endpoint, feel free to access it using various OpenAI-compatible tools like + Chat UI, LangChain, Llama Index, etc. + +??? info "Hugging Face Hub token" + + To use a model with gated access, ensure configuring the `HUGGING_FACE_HUB_TOKEN` environment variable + (with [`--env`](../docs/reference/cli/index.md#dstack-run) in `dstack run` or + using [`env`](../docs/reference/dstack.yml.md#service) in the configuration file). + +
+ + ```shell + $ dstack run . --env HUGGING_FACE_HUB_TOKEN=<token> -f llms/mixtral.dstack.yml --gpu "80GB:2" --disk 200GB + ``` +
+ +## Source code + +The complete, ready-to-run code is available in [dstackai/dstack-examples](https://github.com/dstackai/dstack-examples). + +## What's next? + +1. Check the [vLLM](tgi.md) and [Text Generation Inference](tgi.md) examples +2. Read about [services](../docs/guides/services.md) +3. See all [learning materials](index.md) +4. Join the [Discord server](https://discord.gg/u8SmfwPpMd) \ No newline at end of file diff --git a/docs/learn/tei.md b/docs/learn/tei.md index d55d90dcc..f86b15f37 100644 --- a/docs/learn/tei.md +++ b/docs/learn/tei.md @@ -1,7 +1,7 @@ # Text Embeddings Inference -This example demonstrates how to deploy a text embeddings model as an API using [Services](../docs/guides/services.md) -and [TEI](https://github.com/huggingface/text-embeddings-inference), an open-source framework by Hugging Face. +This example demonstrates how to use [TEI](https://github.com/huggingface/text-embeddings-inference) with `dstack`'s +[services](../docs/guides/services.md) to deploy embeddings. ## Define the configuration diff --git a/docs/learn/tgi.md b/docs/learn/tgi.md index 18a4c5f85..dd3576090 100644 --- a/docs/learn/tgi.md +++ b/docs/learn/tgi.md @@ -1,6 +1,6 @@ # Text Generation Inference -This example demonstrates how to deploy an LLM using [TGI](https://github.com/huggingface/text-generation-inference), an open-source framework by Hugging Face. +This example demonstrates how to use [TGI](https://github.com/huggingface/text-generation-inference) with `dstack`'s [services](../docs/guides/services.md) to deploy LLMs. ## Define the configuration diff --git a/docs/learn/vllm.md b/docs/learn/vllm.md index ad19ff678..d9e13f745 100644 --- a/docs/learn/vllm.md +++ b/docs/learn/vllm.md @@ -1,6 +1,6 @@ # vLLM -This example demonstrates how to deploy an LLM using [Services](../docs/guides/services.md) and [vLLM](https://vllm.ai/), an open-source library. +This example demonstrates how to use [vLLM](https://vllm.ai/) with `dstack`'s [services](../docs/guides/services.md) to deploy LLMs. ## Define the configuration diff --git a/docs/overrides/home.html b/docs/overrides/home.html index 62de6c831..df6b1a628 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -328,6 +328,21 @@

Featured examples