Releases · dstackai/dstack

09 Oct 10:20

r4victor

0.18.17

8a2b9d1

0.18.17 Latest

Latest

On-prem AMD GPU support

dstack now supports SSH fleets with AMD GPUs. Hosts should be pre-installed with Docker and AMDGPU-DKMS kernel driver (e.g. via native package manager or AMDGPU installer).

Elastic Fabric Adapter support

dstack now automatically enables AWS EFA if it is supported by the instance type, no extra configuration needed. The following EFA-enabled instance types are supported: p5.48xlarge, p4d.24xlarge, g4dn.12xlarge, g4dn.16xlarge, g4dn.8xlarge, g4dn.metal, g5.12xlarge, g5.16xlarge, g5.24xlarge, g5.48xlarge, g5.8xlarge, g6.12xlarge, g6.16xlarge, g6.24xlarge, g6.48xlarge, g6.8xlarge, gr6.8xlarge.

Improved apply plan

Previously, dstack apply showed a plan only for run configurations. Now it shows a plan for all configuration types including fleets, volumes, and gateways. Here's a fleet plan showing configuration parameters and the offers that will be tried for provisioning:

✗ dstack apply -f .dstack/confs/fleet.yaml
 Project        main                           
 User           admin                          
 Configuration  .dstack/confs/fleet.yaml       
 Type           fleet                          
 Fleet type     cloud                          
 Nodes          2                              
 Placement      cluster                        
 Backends       aws                            
 Resources      2..xCPU, 8GB.., 100GB.. (disk) 
 Spot policy    on-demand                      

 #  BACKEND  REGION        INSTANCE   RESOURCES                   SPOT  PRICE    
 1  aws      eu-west-1     m5.large   2xCPU, 8GB, 100.0GB (disk)  no    $0.107   
 2  aws      eu-central-1  m5.large   2xCPU, 8GB, 100.0GB (disk)  no    $0.115   
 3  aws      eu-west-1     c5.xlarge  4xCPU, 8GB, 100.0GB (disk)  no    $0.192   
    ...                                                                          
 Shown 3 of 82 offers, $40.9447 max

Fleet my-cluster-fleet does not exist yet.
Create the fleet? [y/n]:

Volumes UI

Server administrators and regular users can now see volumes in the UI.

What's Changed

Dstack version on UI by @olgenn in #1742
Fix restarting gateway connections by @jvstme in #1746
Fix Handle KeyboardInterrupt in CLI when getting run plan #1626 by @IshuSinghSE in #1756
Add AMD support on on-prem fleets by @un-def in #1754
Implement fleet apply plan by @r4victor in #1765
chore: update provisioning.py by @eltociear in #1768
Fix use all available runpod regions by default by @IshuSinghSE in #1757
Implement apply plan for gateways and volumes by @r4victor in #1774
Fix connection to ssh instance on non-standard ssh port by @un-def in #1766
Fix docker SSH commands by @un-def in #1771
Add Llama3.2 Vision Model Example by @Bihan in #1770
Disable backend autoconfig via default creds by @r4victor in #1778
Set backends requests timeouts by @r4victor in #1793
Add UI for volumes #1683 by @olgenn in #1785
UI for volumes 1683 by @olgenn in #1795
[Docs] Add AMD GPU info to ssh fleets section by @un-def in #1779
[shim] Use DockerRootDir to detect free disk space by @un-def in #1802
Add AWS EFA support by @un-def in #1801

New Contributors

@IshuSinghSE made their first contribution in #1756
@eltociear made their first contribution in #1768

Full Changelog: 0.18.16...0.18.17

Contributors

un-def, olgenn, and 5 other contributors

Assets 2

30 Sep 10:29

r4victor

0.18.16

fccc8dd

0.18.16

New versioning policy

Starting with this release, dstack adopts a new versioning policy to provide better server and client backward compatibility and improve the upgrading experience. dstack continues to follow semver versioning scheme ({major}.{minor}.{patch}) with the following principles:

The server backward compatibility is maintained across all minor and patch releases. The specific features can be removed but the removal is preceded with deprecation warnings for several minor releases. This means you can use older client versions with newer server versions.
The client backward compatibility is maintained across patch releases. A new minor release indicates that the release breaks client backward compatibility. This means you don't need to update the server when you update the client to a new patch release. Still, upgrading a client to a new minor version requires upgrading the server too.

Perviously, dstack never guaranteed client backward compatibility, so you had to always update the server when updating the client. The new versioning policy makes the client and server upgrading more flexible.

Note: The new policy only takes affect after both the clients and the server are upgraded to 0.18.16. The 0.18.15 server still won't work with newer clients.

dstack attach

The CLI gets a new dstack attach command that allows attaching to a run. It establishes the SSH tunnel, forwards ports, and streams run logs in real time:

 ✗ dstack attach silent-panther-1
Attached to run silent-panther-1 (replica=0 job=0)
Forwarded ports (local -> remote):
  - localhost:7860 -> 7860
To connect to the run via SSH, use `ssh silent-panther-1`.
Press Ctrl+C to detach...

This command is a replacement for dstack logs --attach with major improvements and bugfixes.

CloudWatch-related bugfixes

The releases includes several important bugfixes for CloudWatchLogStorage. We strongly recommend upgrading the dstack server if it's configured to store logs in CloudWatch.

Deprecations

dstack logs --attach is deprecated in favor of dstack attach and may be removed in the following minor releases.

What's Changed

Check client-server compatibility according to new versioning policy by @r4victor in #1730
[runner] fix MonotonicTimestamp by @un-def in #1728
Gateway-in-server early prototype by @jvstme in #1718
Implement dstack attach command by @r4victor in #1733
Respect CloudWatch timestamp constraints by @un-def in #1732
Add AMD examples with vLLM, Axolotl and Trl by @Bihan in #1693
dstack-proxy naming tweaks by @jvstme in #1734
Fix Failed to attach via Python API by @r4victor in #1739
Support calling RunCollection.get_plan() without repo by @r4victor in #1741

Full Changelog: 0.18.15...0.18.16

Contributors

un-def, Bihan, and 2 other contributors

Assets 2

25 Sep 10:56

r4victor

0.18.15

c187166

0.18.15

Cluster placement groups

Instances of AWS cluster fleets are now provisioned into cluster placement groups for better connectivity. For example, when you create this fleet:

type: fleet
name: my-cluster-fleet
nodes: 4
placement: cluster
backends: [aws]

dstack will automatically create a cluster placement group and use it to provision the instances.

On-prem and VM-based fleets improvements

All available Nvidia driver capabilities are now requested by default, which makes it possible to run GPU workloads requiring OpenGL/Vulkan/RT/Video Codec SDK libraries. (#1714)
Automatic container cleanup. Previously, when the run completed, either successfully or due to an error, its container was not deleted, which led to ever-increasing storage consumption. Now, only the last stopped container is preserved and is available until the next run is completed. (#1706)

Major bug fixes

Fixed a bug where under some conditions logs wouldn't be uploaded to CloudWatch Logs due to size limits. (#1712)
Fixed a bug that prevented running services on on-prem instances. (#1716)

Changelog

Fix cli connection issue with TPU by @Bihan in #1705
Rename --default to --yes and no-default to --no in dstack config and dstack server by @peterschmidt85 in #1709
[CI] Fix shim/runner release versions by @un-def in #1704
Document run diagnostic logs by @r4victor in #1710
[shim] Add old container cleanup routine by @un-def in #1706
Write events to CloudWatch in batches by @un-def in #1712
[shim] Request all Nvidia driver capabilities by @un-def in #1714
Added showing dstack version on the UI by @olgenn in #1717
Add missing project SSH key to on-prem instances by @un-def in #1716
Simplify handling missing GatewayConfiguration by @jvstme in #1724
[shim] Fix container logs processing by @un-def in #1721
Support AWS placement groups for cluster fleets by @r4victor in #1725

Full Changelog: 0.18.14...0.18.15

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

24 Sep 06:53

un-def

0.18.15rc1

1ce1243

0.18.15rc1 Pre-release

Pre-release

On-prem and VM-based fleets improvements

All available Nvidia driver capabilities are now requested by default, which makes it possible to run GPU workloads requiring OpenGL/Vulkan/RT/Video Codec SDK libraries.
Automatic container cleanup. Previously, when the run completed, either successfully or due to an error, its container was not deleted, which led to ever-increasing storage consumption. Now, only the last stopped container is preserved and is available until the next run is completed.

Major bug fixes

Fixed a bug where under some conditions logs wouldn't be uploaded to CloudWatch Logs due to size limits.

Changelog

[UX] Rename --default to --yes and --no-default to --no in dstack config and dstack server by @peterschmidt85 in #1709
Fix cli connection issue with TPU by @Bihan in #1705
Fix dstack-shim and dstack-runner release versions by @un-def in #1704
Request all Nvidia driver capabilities by @un-def in #1714
Add old container cleanup routine by @un-def in #1706
Write events to CloudWatch in batches by @un-def in #1712
[Docs] Document run diagnostic logs by @r4victor in #1710
[Docs] Added the server deployment guide, updated the README.md for the Docker Hub, fixed the scrolling issue by @peterschmidt85

Full changelog: 0.18.14...0.18.15rc1

Contributors

un-def, Bihan, and 2 other contributors

Assets 2

18 Sep 09:40

r4victor

0.18.14

854c812

0.18.14

Multi-replica server deployment

Previously, the dstack server only supported deploying a single instance (replica). However, with 0.18.14, you can now deploy multiple replicas, enabling high availability and zero-downtime updates

Note

Multi-replica server deployment requires using Postgres instead of the default SQLite. To configure Postgres, set the DSTACK_DATABASE_URL environment variable.

Make sure to update to version 0.18.14 before configuring multiple replicas.

Major bug-fixes

[Bugfix] dstack init --git-identity doesn't accept backslashes in path on Windows by @un-def in #1686
[Bugfix] Use -tmpfs /dev/shm:rw,nosuid,nodev,exec,size=X instead of --shm-size=X @un-def in #1690
[Bugfix] dstack-shim is not updated when fleet is recreated by @un-def in #1698

Other

[Bugfix] Fix SSHAttach.reuse_ports_lock() when no grep matches by @un-def in #1700
[Bugfix] Fix logger exception on instance provisioning timeout by @un-def in #1697
[Internal] Add JobProvisioningData.base_backend by @r4victor in #1682
[Internal] Add Run.error by @r4victor in #1684
[Internal] Return server_version in /api/server/get_info by @r4victor in #1685
[Internal] Allow gateway to connect to replicated server by @jvstme in #1688
[Internal] Adjust gateway management for multiple server replicas by @r4victor in #1691
[Internal] Skip gateway update if gateway was updated recently by @r4victor in #1695
[Internal] Remove redundant logger.error by @r4victor in #1702

Full changelog: 0.18.13...0.18.14

Contributors

un-def, r4victor, and jvstme

Assets 2

11 Sep 14:24

peterschmidt85

0.18.13

f277126

0.18.13

Windows

You can now use the CLI on Windows (WSL 2 is not required).

Ensure that Git and OpenSSH are installed via Git for Windows.

During installation, select Git from the command line and also from 3-rd party software
(or Use Git and optional Unix tools from the Command Prompt), and Use bundled OpenSSH checkboxes.

Spot policy

Previously, dev environments used the on-demand spot policy, while tasks and services used auto. With this update, we've changed the default spot policy to always be on-demand for all configurations. Users will now need to explicitly specify the spot policy if they want to use spot instances.

Troubleshooting

The documentation now includes a Troubleshooting guide with instructions on how to report issues.

Changelog

[UX] Add Windows support by @un-def in #1675
[UX] Changed the default spot_policy to on-demand by @r4victor in #1657 and #1660
[UI] Minor UI improvements by @olgenn in #1658
[UX] Check SSH keys when SSH fleet creation before submission by @r4victor in #1661
[Docs] Add TPU examples with Optimum TPU and vLLM by @Bihan in #1663
[Troubleshooting] Do not auto-delete failed instances by @r4victor in #1665
[Docs] Document SQLite to Postgres migration by @r4victor in #1678
[Internal] Implement Postgres locking by @r4victor in #1651
[Internal] Refactor SSHTunnel by @jvstme in #1669
[Internal] Replace String with Text for long database columns by @r4victor in #1677
[Internal] Take advisory lock on server init by @r4victor in #1674

All commits: 0.18.12...0.18.13

Contributors

un-def, olgenn, and 3 other contributors

Assets 2

04 Sep 12:15

un-def

0.18.12

1537163

0.18.12

Features

Added support for ECDSA and Ed25519 keys for on-prem fleets by @swsvc in #1641

Major bugfixes

Fixed the order of CloudWatch log events in the web interface by @un-def in #1613
Fixed a bug where CloudWatch log events might not be displayed in the web inferface for old runs by @un-def in #1652
Prevent possible server freeze on SSH connections by @jvstme in #1627

Other changes

[CLI] Show run name before detaching by @jvstme in #1607
Increase time waiting for OCI Bare Metal instances by @jvstme in #1630
Update lambda regions by @r4victor in #1634
Change CloudWatch group check method by @un-def in #1615
Add Postgres tests by @r4victor in #1628
Fix lambda tests by @r4victor in #1635
[Docs] Fixed a bug where search included non-existing pages that land to 404 by @peterschmidt85 in #1646
[Docs] Introduce the Providers page by @peterschmidt85 in #1653
[Docs] Update RunPod & DataCrunch setup guides by @jvstme in #1608
[Docs] Add information about run log storage by @un-def in #1621
[Internal] Update packer templates docs by @jvstme in #1619

Full changelog: 0.18.11...0.18.12

Contributors

un-def, r4victor, and 3 other contributors

Assets 2

04 Sep 11:28

un-def

0.18.12rc1

1537163

0.18.12rc1 Pre-release

Pre-release

Features

Added support for ECDSA and Ed25519 keys for on-prem fleets by @swsvc in #1641

Major bugfixes

Fixed the order of CloudWatch log events in the web interface by @un-def in #1613
Fixed a bug where CloudWatch log events might not be displayed in the web inferface for old runs by @un-def in #1652
Prevent possible server freeze on SSH connections by @jvstme in #1627

Other changes

[CLI] Show run name before detaching by @jvstme in #1607
Increase time waiting for OCI Bare Metal instances by @jvstme in #1630
Update lambda regions by @r4victor in #1634
Change CloudWatch group check method by @un-def in #1615
Add Postgres tests by @r4victor in #1628
Fix lambda tests by @r4victor in #1635
[Docs] Fixed a bug where search included non-existing pages that land to 404 by @peterschmidt85 in #1646
[Docs] Introduce the Providers page by @peterschmidt85 in #1653
[Docs] Update RunPod & DataCrunch setup guides by @jvstme in #1608
[Docs] Add information about run log storage by @un-def in #1621
[Internal] Update packer templates docs by @jvstme in #1619

Full changelog: 0.18.11...0.18.12rc1

Contributors

un-def, r4victor, and 3 other contributors

Assets 2

22 Aug 12:48

peterschmidt85

0.18.11

3a32226

0.18.11

AMD

With the latest update, you can now specify an AMD GPU under resources. Below is an example.

type: service
name: amd-service-tgi

image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
  - TRUST_REMOTE_CODE=true
  - ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
  - text-generation-launcher --port 8000
port: 8000

resources:
  gpu: MI300X
  disk: 150GB

spot_policy: auto

model:
  type: chat
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  format: openai

Note

AMD accelerators are currently supported only with the runpod backend. Support for on-prem fleets and more backends
is coming soon.

GPU vendors

The gpu property now accepts the vendor attribute, with supported values: nvidia, tpu, and amd.

Alternatively, you can also prefix the GPU name with the vendor name followed by a colon, for example: tpu:v2-8 or amd:192GB, etc. This change ensures consistency in GPU requirements configuration across vendors.

Encryption

dstack now supports encryption of sensitive data, such as backend credentials, user tokens, etc. Learn more on the reference page.

Storing logs in AWS CloudWatch

By default, the dstack server stores run logs in ~/.dstack/server/projects/<project name>/logs. To store logs in AWS CloudWatch, set the DSTACK_SERVER_CLOUDWATCH_LOG_GROUP environment variable.

Project manager role

With this update, it's now possible to assign any user as a project manager. This role grants permission to manage project users but does not allow management of backends or resources.

Default permissions

By default, all users can create and manage their own projects. If you want only global admins to create projects, add the following to ~/.dstack/server/config.yml:

default_permissions:
  allow_non_admins_create_projects: false

Other

[Feature] Allow to store logs in AWS CloudWatch by @un-def in #1597 and #1597
[Feature] Introduce default permissions #1559 by @olgenn in #1567
[Feature] Support the vendor property under resources.gpu @un-def in #1558
[Feature] Implement configurable default permissions by @r4victor in #1591
[Bugfix] Provision AWS instances in all eligible availability zones by @r4victor in #1585
[Bugfix] Support users without projects @olgenn in #1578
[UI] Support manager project role @olgenn in #1566
[Docs] Mention AMD GPUs, describe gpu.vendor property by @un-def in #1570
[Bugfix] Fix global admin restricted by manager role by @r4victor in #1592
[Bugfix] Fixed defect with incorrect setting project role in the UI by @olgenn in #1593
[Bugfix] Abort provisioning fleet when parsing ssh key fails(#1442) by @swsvc in #1589
[UI] Ensure users can create projects #191 by @olgenn in #1554
[UI] Use a toggle button switching themes #190 by @olgenn in #1556
[UI] Fix the Logs component appearance for the dark theme by @olgenn in #1579
[UI] Minor restyle of the side navigation by @olgenn in #1580
[Bugfix] Avoid TGI error logit_bias: invalid type by @jvstme in #1557
[Docs] Document projects #1547 by @peterschmidt85 in #1548
[Docs] Document AMD support on RunPod by @peterschmidt85 in #1598
[Internal] Approximate on-prem GPU memory size by @jvstme in #1588
[Docs] Fix some of the broken links by @jvstme in #1602
[Docs] Fix broken links in README.md by @jvstme in #1604
[Docs] Document configuring logs storage in AWS CloudWatch @un-def in #1606
[Docs] Publish the blog post and examples about AMD on RunPod by @peterschmidt85 in #1598
[Internal] Force root in Kubernetes runs by @jvstme in #1555
[Internal] Improve gateway auth issues troubleshooting by @jvstme in #1569
[Feature] Implement "encryption at rest" by @r4victor in #1561
[Feature] Implement project manager role by @r4victor in #1572
[Feature] Implement user activation/deactivation by @r4victor in #1575
[Internal] Reintroduce tpu- prefix; add tpu vendor alias by @un-def in #1587

New contributors

@swsvc made their first contribution in #1589

Full changelog: 0.18.10...0.18.11

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

21 Aug 14:25

peterschmidt85

0.18.11rc1

5bf3952

0.18.11rc1 Pre-release

Pre-release

AMD

With the latest update, you can now specify an AMD GPU under resources. Below is an example.

type: service
name: amd-service-tgi

image: ghcr.io/huggingface/text-generation-inference:sha-a379d55-rocm
env:
  - HUGGING_FACE_HUB_TOKEN
  - MODEL_ID=meta-llama/Meta-Llama-3.1-70B-Instruct
  - TRUST_REMOTE_CODE=true
  - ROCM_USE_FLASH_ATTN_V2_TRITON=true
commands:
  - text-generation-launcher --port 8000
port: 8000

resources:
  gpu: MI300X
  disk: 150GB

spot_policy: auto

model:
  type: chat
  name: meta-llama/Meta-Llama-3.1-70B-Instruct
  format: openai

Note

AMD accelerators are currently supported only with the runpod backend. Support for on-prem fleets and more backends
is coming soon.

Other

[Docs] Document projects #1547 by @peterschmidt85 in #1548
[UI] Ensure users can create projects #191 by @olgenn in #1554
[UI] Use a toggle button switching themes #190 by @olgenn in #1556
[Bugfix] Force root in Kubernetes runs by @jvstme in #1555
[Bugfix] Avoid TGI error logit_bias: invalid type by @jvstme in #1557
Support the vendor property under gpu @un-def in #1558
[Internal] Improve gateway auth issues troubleshooting by @jvstme in #1569
[Feature] Implement "encryption at rest" by @r4victor in #1561
[Feature] Implement project manager role by @r4victor in #1572
[Feature] Implement user activation/deactivation by @r4victor in #1575
[Bugfix] Support users without projects @olgenn in #1578
[UI] Fix the Logs component appearance for the dark theme by @olgenn in #1579
[UI] Minor restyle of the side navigation by @olgenn in #1580
[Internal] Replace pkg_resources with importlib.resources by @r4victor in #1582
[UI] Support manager project role @olgenn in #1566
[Bugfix] Provision AWS instances in all eligible availability zones by @r4victor in #1585
[Feature] Implement configurable default permissions by @r4victor in #1591
[Internal] Reintroduce tpu- prefix; add tpu vendor alias by @un-def in #1587
[Docs] Mention AMD GPUs, describe gpu.vendor property by @un-def in #1570
[Bugfix] Fix global admin restricted by manager role by @r4victor in #1592
[Bugfix] Fixed defect with incorrect setting project role in the UI by @olgenn in #1593
[Internal] Order project members by @r4victor in #1594
[Feature] Introduce default permissions #1559 by @olgenn in #1567
[Bugfix] Abort provisioning fleet when parsing ssh key fails(#1442) by @swsvc in #1589
[Feature] Add LogStorage interface, CloudWatch Logs impl by @un-def in #1597
[Docs] Document AMD support on RunPod by @peterschmidt85 in #1598

New contributors

@swsvc made their first contribution in #1589

Full changelog: 0.18.10...0.18.11rc1

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

Releases: dstackai/dstack

0.18.17

On-prem AMD GPU support

Elastic Fabric Adapter support

Improved apply plan

Volumes UI

What's Changed

New Contributors

Contributors

0.18.16

New versioning policy

dstack attach

CloudWatch-related bugfixes

Deprecations

What's Changed

Contributors

0.18.15

Cluster placement groups

On-prem and VM-based fleets improvements

Major bug fixes

Changelog

Contributors

0.18.15rc1

On-prem and VM-based fleets improvements

Major bug fixes

Changelog

Contributors

0.18.14

Multi-replica server deployment

Major bug-fixes

Other

Contributors

0.18.13

Windows

Spot policy

Troubleshooting

Changelog

Contributors

0.18.12

Features

Major bugfixes

Other changes

Contributors

0.18.12rc1

Features

Major bugfixes

Other changes

Contributors

0.18.11

AMD

GPU vendors

Encryption

Storing logs in AWS CloudWatch

Project manager role

Default permissions

Other

New contributors

Contributors

0.18.11rc1

AMD

Other

New contributors

Contributors