Mitigate concurrent dstack attach
issues
#1816
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Don't use SO_REUSEADDR. Although it might be helpful, as it allows to bind a socket in TIME_WAIT state (this is the reason it was added in the first place), unfortunately it has some other (unwanted) effects:
These other effects increase discrepancy between Linux and BSD (incl. macOS).
Run ssh with ExitOnForwardFailure=yes. With this option, ssh exits with error if it cannot bind() to requested ports.
This is not an ideal solution. the way Run.attach() works, it's still possible to PortsLock.acquire() the same local port due to race condition (if one client just released the lock, but ssh hasn't yet established the tunnel, another client can acquire the same port in between, but with the second fix applied it will eventually fail, as it will not be able to establish ssh tunnel in 10 attempts.
It would be better to acquire PortsLock as soon as possible (don't wait for RunStatus.RUNNING), but it requires refactoring, including the public Python API.
Fixes: #1814