-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip injection of nvidia-persistenced socket by default #694
Conversation
e1cfb49
to
f44e31f
Compare
|
||
// Opt out of the fabricmanager and persistenced sockets if these are not | ||
// explicitly enabled. | ||
if !hook.Features.AllowFabricmanager.IsEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, containers depending on this socket (e.g. NCCL library) will error out after this update by default? This will impact multi-node training Jobs etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the scope of this PR to only nvidia-persistenced
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's do just persistenced for now (not fabricmanager)
- Let's change the feature-gate name to
include-persistenced
notallow-persistenced
d517584
to
fa0ec9d
Compare
This changes skips the injection of the nvidia-persistenced socket by default. An include-persistenced-socket feature flag is added to allow the injection of this socket to be explicitly requested. Signed-off-by: Evan Lezar <elezar@nvidia.com>
fa0ec9d
to
d2deb77
Compare
This change ensures that the internal CDI representation includes the persistenced socket if the include-persistenced-socket feature flag is enabled. Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an include-persistenced-socket flag to the nvidia-ctk cdi generate command that ensures that a generated specification includes the nvidia-persistenced socket if present on the host. Note that for mangement mode, these sockets are always included if detected. Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables opt-in (off-by-default) features to be opted into. These features can be toggled by name by specifying the (repeated) --opt-in-feature command line argument or as a comma-separated list in the NVIDIA_CONTAINER_TOOLKIT_OPT_IN_FEATURES environment variable. Signed-off-by: Evan Lezar <elezar@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not remove the fabric-manager socket as part of this PR.
baaaab3
to
9c2476c
Compare
addressed and re-reviewed by @cdesiniotis
This changes skips the injection of the nvidia-persistenced socket by default.
An include-persistenced-socket feature flag is added to allow the
injection of this socket to be explicitly requested.
See NVIDIA/libnvidia-container#283