From 25e1c8cfd203099e61856f6e660d0e08fd480f56 Mon Sep 17 00:00:00 2001 From: peterschmidt85 Date: Fri, 6 Sep 2024 20:00:01 +0200 Subject: [PATCH] [Docs] Added the Troubleshooting guide [Docs] Minor edits --- docs/docs/guides/troubleshooting.md | 114 ++++++++++++++++++++++++++++ docs/docs/installation/index.md | 4 + docs/docs/quickstart.md | 4 + docs/overrides/examples.html | 2 +- mkdocs.yml | 1 + 5 files changed, 124 insertions(+), 1 deletion(-) create mode 100644 docs/docs/guides/troubleshooting.md diff --git a/docs/docs/guides/troubleshooting.md b/docs/docs/guides/troubleshooting.md new file mode 100644 index 000000000..8ac35d05d --- /dev/null +++ b/docs/docs/guides/troubleshooting.md @@ -0,0 +1,114 @@ +# Troubleshooting + +## Reporting issues + +When you encounter a problem and need help, it's essential to report it as a [GitHub issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/new/choose){:target="_blank"}. +Please avoid brining up the issue to Discord before reporting it. + +!!! warning "Steps to reproduce" + Make sure to provide clear, detailed steps to reproduce the issue. This will allow us to troubleshoot it and request any + additional information. Include server logs, CLI outputs, and configuration samples. + Avoid using screenshots for logs or errors—use text instead. + + See these examples for well-reported issues: [this :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1640){:target="_blank"} + and [this :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1551){:target="_blank"}. + +## Typical issues + +### Provisioning fails + +In certain cases, running `dstack apply` may produce the following output: + +```shell +wet-mangust-1 provisioning completed (failed) +All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details. +``` + +#### Backend configuration + +If runs consistently fail to provision due to insufficient capacity, it’s likely there is a backend configuration issue. +Ensure that your backends are configured correctly and check the server logs for any errors. + +#### Service quotas + +If some runs fail to provision, it may be due to an insufficient service quota. For cloud providers like AWS, GCP, +Azure, and OCI, you often need to request an increased [service quota](protips.md#service-quotas) before you can use +specific instances. + +#### Resources + +Another possible cause of the insufficient capacity error is that `dstack` cannot find an instance that meets the +requirements specified in `resources`. + +??? info "GPU" + The `gpu` property allows you to specify the GPU name, memory, and quantity. Examples include `A100` (one GPU), `A100:40GB` ( + one GPU with exact memory), `A100:4` (four GPUs), etc. If you specify a GPU name without a quantity, it defaults to `1`. + + If you request one GPU but only instances with eight GPUs are available, `dstack` won’t be able to provide it. Use range + syntax to specify a range, such as `A100:1..8` (one to eight GPUs) or `A100:1..` (one or more GPUs). + +??? info "Disk" + If you don't specify the `disk` property, `dstack` defaults it to `100GB`. + In case there is no such instance available, `dstack` won’t be able to provide it. + Use range syntax to specify a range, such as `50GB..100GB` (from fifty GBs to one hundred GBs) or `50GB..` + (fifty GBs or more). + +### Run fails + +There could be several reasons for a run failing after successful provisioning. + +!!! info "Termination reason" + To find out why, use `-v` (stands for `--verbose`) with `dstack ps`. + This will show the run's status and any failure reasons. + +#### Spot interruption + +If a run fails after provisioning with the termination reason `INTERRUPTED_BY_NO_CAPACITY`, it is likely that the run +was using spot instances and was interrupted. To address this, you can either set the +[`spot_policy`](../reference/dstack.yml/task.md#spot_policy) to `on-demand` or specify the +[`retry`](../reference/dstack.yml/task.md#retry) property. + +[//]: # (#### Other) +[//]: # (TODO: Explain how to get the shim logs) + +### Can't run a service + +#### Gateway configuration + +The most common reason a service fails to start is either because you haven’t [created a gateway](../concepts/gateways.md) or haven’t set up the +correct DNS record pointing to the gateway's hostname. + +### Service endpoint doesn't work + +#### Authorization + +If the service endpoint returns a 403 error, it is likely because the [`Authorization`](../services.md#access-the-endpoint) +header with the correct `dstack` token was not provided. + +#### On-prem fleets + +If you attempt to run a service on an on-prem fleet, it won't work due to a [known issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1640){:target="_blank"} that is expected to be fixed soon. + +[//]: # (#### Other) +[//]: # (TODO: Explain how to get the gateway logs) + +### Cannot access a dev environment or a task's ports + +When running a dev environment or task with configured ports, `dstack apply` +automatically forwards remote ports to `localhost` via SSH for easy and secure access. +If you interrupt the command, the port forwarding will be disconnected. To reattach, use `dstack logs --attach `dstack apply` automatically uploads the code from the current repo, including your local uncommitted changes. +## Troubleshooting + +Something not working? Make sure to check out the [troubleshooting](guides/troubleshooting.md) guide. + ## What's next? 1. Read about [dev environments](dev-environments.md), [tasks](tasks.md), diff --git a/docs/overrides/examples.html b/docs/overrides/examples.html index 5b7640279..45821fb42 100644 --- a/docs/overrides/examples.html +++ b/docs/overrides/examples.html @@ -89,7 +89,7 @@

- Learn how to deploy and fine-tune LLMs on Google TPU. + Learn how to deploy and fine-tune LLMs on TPU.

diff --git a/mkdocs.yml b/mkdocs.yml index 59b3b03ff..6395789a5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -211,6 +211,7 @@ nav: - Volumes: docs/concepts/volumes.md - Projects: docs/concepts/projects.md - Guides: + - Troubleshooting: docs/guides/troubleshooting.md - Protips: docs/guides/protips.md - Reference: - server/config.yml: docs/reference/server/config.yml.md