Skip to content

Commit

Permalink
[Docs] Added the Troubleshooting guide
Browse files Browse the repository at this point in the history
[Docs] Minor edits
  • Loading branch information
peterschmidt85 committed Sep 6, 2024
1 parent f654022 commit 25e1c8c
Show file tree
Hide file tree
Showing 5 changed files with 124 additions and 1 deletion.
114 changes: 114 additions & 0 deletions docs/docs/guides/troubleshooting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Troubleshooting

## Reporting issues

When you encounter a problem and need help, it's essential to report it as a [GitHub issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/new/choose){:target="_blank"}.
Please avoid brining up the issue to Discord before reporting it.

!!! warning "Steps to reproduce"
Make sure to provide clear, detailed steps to reproduce the issue. This will allow us to troubleshoot it and request any
additional information. Include server logs, CLI outputs, and configuration samples.
Avoid using screenshots for logs or errors—use text instead.

See these examples for well-reported issues: [this :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1640){:target="_blank"}
and [this :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1551){:target="_blank"}.

## Typical issues

### Provisioning fails

In certain cases, running `dstack apply` may produce the following output:

```shell
wet-mangust-1 provisioning completed (failed)
All provisioning attempts failed. This is likely due to cloud providers not having enough capacity. Check CLI and server logs for more details.
```

#### Backend configuration

If runs consistently fail to provision due to insufficient capacity, it’s likely there is a backend configuration issue.
Ensure that your backends are configured correctly and check the server logs for any errors.

#### Service quotas

If some runs fail to provision, it may be due to an insufficient service quota. For cloud providers like AWS, GCP,
Azure, and OCI, you often need to request an increased [service quota](protips.md#service-quotas) before you can use
specific instances.

#### Resources

Another possible cause of the insufficient capacity error is that `dstack` cannot find an instance that meets the
requirements specified in `resources`.

??? info "GPU"
The `gpu` property allows you to specify the GPU name, memory, and quantity. Examples include `A100` (one GPU), `A100:40GB` (
one GPU with exact memory), `A100:4` (four GPUs), etc. If you specify a GPU name without a quantity, it defaults to `1`.

If you request one GPU but only instances with eight GPUs are available, `dstack` won’t be able to provide it. Use range
syntax to specify a range, such as `A100:1..8` (one to eight GPUs) or `A100:1..` (one or more GPUs).

??? info "Disk"
If you don't specify the `disk` property, `dstack` defaults it to `100GB`.
In case there is no such instance available, `dstack` won’t be able to provide it.
Use range syntax to specify a range, such as `50GB..100GB` (from fifty GBs to one hundred GBs) or `50GB..`
(fifty GBs or more).

### Run fails

There could be several reasons for a run failing after successful provisioning.

!!! info "Termination reason"
To find out why, use `-v` (stands for `--verbose`) with `dstack ps`.
This will show the run's status and any failure reasons.

#### Spot interruption

If a run fails after provisioning with the termination reason `INTERRUPTED_BY_NO_CAPACITY`, it is likely that the run
was using spot instances and was interrupted. To address this, you can either set the
[`spot_policy`](../reference/dstack.yml/task.md#spot_policy) to `on-demand` or specify the
[`retry`](../reference/dstack.yml/task.md#retry) property.

[//]: # (#### Other)
[//]: # (TODO: Explain how to get the shim logs)

### Can't run a service

#### Gateway configuration

The most common reason a service fails to start is either because you haven’t [created a gateway](../concepts/gateways.md) or haven’t set up the
correct DNS record pointing to the gateway's hostname.

### Service endpoint doesn't work

#### Authorization

If the service endpoint returns a 403 error, it is likely because the [`Authorization`](../services.md#access-the-endpoint)
header with the correct `dstack` token was not provided.

#### On-prem fleets

If you attempt to run a service on an on-prem fleet, it won't work due to a [known issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1640){:target="_blank"} that is expected to be fixed soon.

[//]: # (#### Other)
[//]: # (TODO: Explain how to get the gateway logs)

### Cannot access a dev environment or a task's ports

When running a dev environment or task with configured ports, `dstack apply`
automatically forwards remote ports to `localhost` via SSH for easy and secure access.
If you interrupt the command, the port forwarding will be disconnected. To reattach, use `dstack logs --attach <run name`.

#### Windows

If you're using the CLI on Windows, make sure to run it through WSL by following [these instructions:material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1644#issuecomment-2321559265){:target="_blank"}.
Native support will be available soon.

### An on-prem fleet doesn't provision

If you set up an on-prem fleet and it fails to provision after a long wait, first check the server logs.
Also, review the `/root/.dstack/shim.log` file on each host used to create the fleet.

## Questions

!!! info "Community"
If you have a question, please feel free to ask it in our [Discord server](https://discord.gg/u8SmfwPpMd).
4 changes: 4 additions & 0 deletions docs/docs/installation/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ Configuration is updated at ~/.dstack/config.yml

This configuration is stored in `~/.dstack/config.yml`.

??? info "Windows"
If you're using the CLI on Windows, make sure to run it through WSL by following [these instructions :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1644#issuecomment-2321559265){:target="_blank"}.
Native support will be available soon.

## Create on-prem fleets

If you want the `dstack` server to run containers on your on-prem clusters,
Expand Down
4 changes: 4 additions & 0 deletions docs/docs/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,10 @@ Your folder can be a regular local folder or a Git repo.

> `dstack apply` automatically uploads the code from the current repo, including your local uncommitted changes.
## Troubleshooting

Something not working? Make sure to check out the [troubleshooting](guides/troubleshooting.md) guide.

## What's next?

1. Read about [dev environments](dev-environments.md), [tasks](tasks.md),
Expand Down
2 changes: 1 addition & 1 deletion docs/overrides/examples.html
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ <h3>
</h3>

<p>
Learn how to deploy and fine-tune LLMs on Google TPU.
Learn how to deploy and fine-tune LLMs on TPU.
</p>
</a>
</div>
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ nav:
- Volumes: docs/concepts/volumes.md
- Projects: docs/concepts/projects.md
- Guides:
- Troubleshooting: docs/guides/troubleshooting.md
- Protips: docs/guides/protips.md
- Reference:
- server/config.yml: docs/reference/server/config.yml.md
Expand Down

0 comments on commit 25e1c8c

Please sign in to comment.