Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump NCCL floor to 2.18.1.1, include nccl.h where it's needed #4661

Merged
merged 2 commits into from
Sep 25, 2024

Conversation

jameslamb
Copy link
Member

Description

Contributes to rapidsai/build-planning#102

Some RAPIDS libraries are using ncclCommSplit(), which was introduced in nccl==2.18.1.1. This is part of a series of PRs across RAPIDS updating libraries' pins to nccl>=2.18.1.1 to ensure they get a new-enough version that supports that.

@jameslamb jameslamb added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 20, 2024
@github-actions github-actions bot added the conda label Sep 20, 2024
@jameslamb jameslamb changed the title WIP: bump NCCL floor to 2.18.1.1 bump NCCL floor to 2.18.1.1 Sep 20, 2024
@jameslamb jameslamb marked this pull request as ready for review September 20, 2024 21:57
@jameslamb jameslamb requested a review from a team as a code owner September 20, 2024 21:57
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cugraph also ought to be able to rely on raft for its NCCL handling rather than doing anything on its own AFAICT. It does use some raft APIs that accept NCCL types directly as arguments, but I still think that's OK.

@jameslamb
Copy link
Member Author

I'm not convinced that cugraph could/should drop its NCCL dependency and only get it transitively through raft.

It has some direct uses, like these:

RAFT_NCCL_TRY(ncclCommDestroy(*nccl_comms_[i]));

ncclCommInitRank(nccl_comms[i].get(), ranks_to_include.size(), instance_manager_id, rank));

As well as others in tests and examples:

#include <nccl.h>

include(../../../cmake/thirdparty/get_nccl.cmake)

target_link_libraries(graph_operations PRIVATE cugraph::cugraph NCCL::NCCL MPI::MPI_CXX)

Those look like direct-enough uses that cugraph should have its own pinnings, in my opinion.

@vyasr
Copy link
Contributor

vyasr commented Sep 24, 2024

I agree, those are direct uses. In that case, though, the files should be including nccl.h rather than relying on it being transitively included by raft's std_comms.h.

@jameslamb jameslamb requested a review from a team as a code owner September 24, 2024 18:54
@jameslamb
Copy link
Member Author

the files should be including nccl.h rather than relying on it being transitively included by raft's std_comms.h.

Agreed! Pushed 7fca13f adding that.

Maybe in the future we can explore using include-what-you-use (https://github.com/include-what-you-use/include-what-you-use) in CI. I found from search that you'd tried that last year: rapidsai/cudf#581 (comment), maybe at some point we can look into it again.

@jameslamb jameslamb changed the title bump NCCL floor to 2.18.1.1 bump NCCL floor to 2.18.1.1, include nccl.h where it's needed Sep 24, 2024
@jameslamb
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 27f2256 into rapidsai:branch-24.10 Sep 25, 2024
131 checks passed
@jameslamb jameslamb deleted the update-nccl branch September 25, 2024 04:25
@jakirkham jakirkham mentioned this pull request Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conda cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants