Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip injection of nvidia-persistenced socket by default #694

Merged
merged 4 commits into from
Sep 18, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Sep 16, 2024

This changes skips the injection of the nvidia-persistenced socket by default.

An include-persistenced-socket feature flag is added to allow the
injection of this socket to be explicitly requested.

See NVIDIA/libnvidia-container#283

@elezar elezar changed the title Add opt in to sockets Skip injection of sockets by default Sep 16, 2024
@elezar elezar marked this pull request as ready for review September 16, 2024 13:38

// Opt out of the fabricmanager and persistenced sockets if these are not
// explicitly enabled.
if !hook.Features.AllowFabricmanager.IsEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, containers depending on this socket (e.g. NCCL library) will error out after this update by default? This will impact multi-node training Jobs etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the scope of this PR to only nvidia-persistenced.

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's do just persistenced for now (not fabricmanager)
  2. Let's change the feature-gate name to include-persistenced not allow-persistenced

@elezar elezar force-pushed the add-opt-in-to-sockets branch 2 times, most recently from d517584 to fa0ec9d Compare September 18, 2024 15:08
@elezar elezar changed the title Skip injection of sockets by default Skip injection of nvidia-persistenced socket by default Sep 18, 2024
@elezar elezar self-assigned this Sep 18, 2024
@elezar elezar requested a review from klueska September 18, 2024 15:09
This changes skips the injection of the nvidia-persistenced socket by
default.

An include-persistenced-socket feature flag is added to allow the
injection of this socket to be explicitly requested.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change ensures that the internal CDI representation includes
the persistenced socket if the include-persistenced-socket feature
flag is enabled.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change adds an include-persistenced-socket flag to the
nvidia-ctk cdi generate command that ensures that a generated
specification includes the nvidia-persistenced socket if present on
the host.

Note that for mangement mode, these sockets are always included
if detected.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
This change enables opt-in (off-by-default) features to be opted into.
These features can be toggled by name by specifying the (repeated)
--opt-in-feature command line argument or as a comma-separated list
in the NVIDIA_CONTAINER_TOOLKIT_OPT_IN_FEATURES environment variable.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not remove the fabric-manager socket as part of this PR.

@elezar elezar dismissed klueska’s stale review September 18, 2024 20:36

addressed and re-reviewed by @cdesiniotis

@cdesiniotis cdesiniotis merged commit b061446 into NVIDIA:main Sep 18, 2024
8 checks passed
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Sep 20, 2024
This reverts commit b061446, reversing
changes made to c490baa.

Signed-off-by: Evan Lezar <elezar@nvidia.com>
@elezar elezar mentioned this pull request Sep 20, 2024
elezar added a commit that referenced this pull request Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants