Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate concurrent dstack attach issues #1816

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

un-def
Copy link
Collaborator

@un-def un-def commented Oct 11, 2024

  1. Don't use SO_REUSEADDR. Although it might be helpful, as it allows to bind a socket in TIME_WAIT state (this is the reason it was added in the first place), unfortunately it has some other (unwanted) effects:

    • On Linux, it allows to bind to the same address-port pair (including wildcard), as long as this option is set before each bind().
    • On BSD, it allows to bind to a wildcard address and a specific address, or vice versa (but not the same specific address, as on Linux).

    These other effects increase discrepancy between Linux and BSD (incl. macOS).

  2. Run ssh with ExitOnForwardFailure=yes. With this option, ssh exits with error if it cannot bind() to requested ports.

This is not an ideal solution. the way Run.attach() works, it's still possible to PortsLock.acquire() the same local port due to race condition (if one client just released the lock, but ssh hasn't yet established the tunnel, another client can acquire the same port in between, but with the second fix applied it will eventually fail, as it will not be able to establish ssh tunnel in 10 attempts.

It would be better to acquire PortsLock as soon as possible (don't wait for RunStatus.RUNNING), but it requires refactoring, including the public Python API.

Fixes: #1814

1. Don't use SO_REUSEADDR. Although it might be helpful, as it allows to
   bind a socket in TIME_WAIT state (this is the reason it was added in
   the first place), unfortunately it has some other (unwanted) effects:
    * On Linux, it allows to bind to the same address-port pair
      (including wildcard), as long as this option is set before each
      bind().
    * On BSD, it allows to bind to a wildcard address and a specific
      address, or vice versa (but not the same specific address, as on
      Linux).
These other effects increase discrepancy between Linux and BSD (incl.
macOS).

2. Run ssh with ExitOnForwardFailure=yes. With this option, ssh exits
   with error if it cannot bind() to requested ports.

This is not an ideal solution. the way Run.attach() works, it's still
possible to PortsLock.acquire() the same local port due to race condition
(if one client just released the lock, but ssh hasn't yet established
the tunnel, another client can acquire the same port in between, but
with the second fix applied it will eventually fail, as it will not be
able to establish ssh tunnel in 10 attempts.

It would be better to acquire PortsLock as soon as possible (don't wait
for RunStatus.RUNNING), but it requires refactoring, including the
public Python API.

Fixes: #1814
@un-def un-def requested a review from r4victor October 11, 2024 07:15
@un-def
Copy link
Collaborator Author

un-def commented Oct 11, 2024

Here is a great post explaining all these SO_REUSEADDR quirks and features: https://stackoverflow.com/a/14388707

@un-def un-def merged commit 4972c3e into master Oct 11, 2024
23 checks passed
@un-def un-def deleted the issue_1814_mitigate_concurrent_attach_issues branch October 11, 2024 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: dstack attach --logs sometimes displays logs from another attach, and vice versa
2 participants