Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade to nccl>=2.19 across RAPIDS #102

Closed
6 tasks done
jameslamb opened this issue Sep 20, 2024 · 12 comments
Closed
6 tasks done

upgrade to nccl>=2.19 across RAPIDS #102

jameslamb opened this issue Sep 20, 2024 · 12 comments
Assignees
Milestone

Comments

@jameslamb
Copy link
Member

jameslamb commented Sep 20, 2024

Description

wholegraph nightly conda-cpp-tests (amd64, rockylinux8, CUDA 11.4) have been failing with errors like this:

./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST 

(build link)

ncclCommSplit was first introduced in nccl==2.18.1-1 (commit link) and is used unconditionally in wholegraph, but wholegraph (and other RAPIDS libraries) have a floor of nccl>=2.9.9.

In this particular job, nccl=2.10.3.1 is getting installed.

 + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB

We haven't noticed this in other jobs or other projects' CI because in all of those, nccl>=2.18.1 is getting installed.

In addition, pytorch 2.3 requires nccl>=2.19: #102 (comment)

Across RAPIDS, we should raise the floor on nccl to >=2.19.

Benefits of this work

  • prevents a source of runtime errors when using RAPIDS ML libraries
  • allows testing wholegraph in a CUDA 11.4 environment again

Acceptance Criteria

  • all RAPIDS libraries with explicit nccl dependencies are pinned to nccl>=2.19

Approach

  1. create a branch in https://github.com/rapidsai/shared-workflows adding the test job that's failing in wholegraph nightlies to the PR matrices
  2. update wholegraph's pinnings, open a PR with CI pointed at that branch, and confirm that CI passes
  3. point wholegraph CI back at branch-24.10 of shared-workflows, delete that shared-workflows branch
  4. put up PRs with similar pins across all of the RAPIDS libraries with direct nccl dependencies
  5. merge all those PRs (can be in any order)

Changes

@jameslamb jameslamb self-assigned this Sep 20, 2024
@jameslamb jameslamb changed the title upgrade to nccl>2.18.1 across RAPIDS upgrade to nccl>=2.18.1.1 across RAPIDS Sep 20, 2024
@jakirkham jakirkham added this to the v24.10 milestone Sep 24, 2024
rapids-bot bot pushed a commit to rapidsai/integration that referenced this issue Sep 24, 2024
Contributes to rapidsai/build-planning#102

Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #723
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this issue Sep 25, 2024
Contributes to rapidsai/build-planning#102

Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/jakirkham

URL: #4661
rapids-bot bot pushed a commit to rapidsai/wholegraph that referenced this issue Sep 25, 2024
Contributes to rapidsai/build-planning#102

Fixes #217

## Notes for Reviewers

### How I tested this

Temporarily added a CUDA 11.4.3 test job to CI here (the same specs as the failing nightly), by pointing at the branch from rapidsai/shared-workflows#246.

Observed the exact same failures with CUDA 11.4 reported in rapidsai/build-planning#102.

```text
...
  + nccl                     2.10.3.1  hcad2f07_0                  rapidsai-nightly     125MB
...
./WHOLEGRAPH_CSR_WEIGHTED_SAMPLE_WITHOUT_REPLACEMENT_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./WHOLEMEMORY_HANDLE_TEST 
./WHOLEMEMORY_HANDLE_TEST: symbol lookup error: /opt/conda/envs/test/bin/gtests/libwholegraph/../../../lib/libwholegraph.so: undefined symbol: ncclCommSplit
sh -c exec "$0" ./GRAPH_APPEND_UNIQUE_TEST 
```

([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966022370/job/30453393224?pr=218))

Pushed a commit adding a floor of `nccl>=2.18.1.1`. Saw all tests pass with CUDA 11.4 😁 

```text
...
  + nccl                     2.22.3.1  hee583db_1                  conda-forge          131MB
...
(various log messages showing all tests passed)
```

([build link](https://github.com/rapidsai/wholegraph/actions/runs/10966210441/job/30454147250?pr=218))

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - https://github.com/linhu-nv
  - https://github.com/jakirkham

URL: #218
rapids-bot bot pushed a commit to rapidsai/cuvs that referenced this issue Sep 25, 2024
Contributes to rapidsai/build-planning#102

Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that.

`cuvs` doesn't have any *direct* uses of NCCL... it only uses it via raft. This PR proposes removing `cuvs`'s dependency pinnings on NCCL, in favor of just using whatever it gets transitively via raft.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #341
@jakirkham
Copy link
Member

In cuGraph, as part of the NumPy 2 upgrade ( rapidsai/cugraph#4615 (review) ), we noticed the following logic

# Starting from 2.2, PyTorch wheels depend on nvidia-nccl-cuxx>=2.19 wheel and
# dynamically link to NCCL. RAPIDS CUDA 11 CI images have an older NCCL version that
# might shadow the newer NCCL required by PyTorch during import (when importing
# `cupy` before `torch`).
if [[ "${NCCL_VERSION}" < "2.19" ]]; then
  PYTORCH_VER="2.1.0"
else
  PYTORCH_VER="2.3.0"
fi

Given this, am wondering if the minimum NCCL version should be 2.19?

cc @hcho3 (for awareness)

rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Sep 26, 2024
Contributes to rapidsai/build-planning#102

Some RAPIDS libraries are using `ncclCommSplit()`, which was introduced in `nccl==2.18.1.1`. This is part of a series of PRs across RAPIDS updating libraries' pins to `nccl>=2.18.1.1` to ensure they get a new-enough version that supports that.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #2443
@jameslamb
Copy link
Member Author

jameslamb commented Sep 26, 2024

There are some competing requirements here:

  • we want to pin pytorch>=2.3 to support numpy>=2
  • pytorch>=2.3 requires NCCL >= 2.19
  • raft's floor was just set to NCCL >=2.18.1.1, in bump NCCL floor to 2.18.1.1 raft#2443
  • raft goes into code freeze for 24.10 in 4 hours
  • raft is used widely, including in places that don't also use pytorch (like cuml)
  • we generally want the floor for NCCL to be the same across RAPIDS libraries if possible

Given how close it is to 24.10 code freeze for raft, I don't think we should raise raft's floor further to 2.19 at this point. I think it's ok for cugraph to have a slightly higher floor than raft for the 24.10 release... in practice, both libraries have been building against much newer versions of NCCL for a while anyway. For example, look at the most recent successful nightly build of cugraph on 24.10... it used NCCL 2.22.3.1:

    nccl:                             2.22.3.1-hd961488_1                        conda-forge   

(build link)

Here's my proposal:

  • 24.10:
    • raft stays pinned to nccl>=2.18.1.1 (because it's about to enter code freeze)
    • cugraph pins to nccl>=2.19 (to support pytorch>=2.3 and therefore numpy>2)
  • 24.12
    • all RAPIDS libraries with NCCL pins to nccl>=2.19

@jakirkham @hcho3 what do you think?

@hcho3
Copy link

hcho3 commented Sep 26, 2024

Sounds good to me

@jakirkham
Copy link
Member

Think if we are planning to bump the minimum across the board, we should go ahead and get it done

Yes we are at code freeze and yes it is not great we are finding this change before then. However I think having the right intended behavior trumps that

Would add that a user not constraining NCCL likely gets a much newer version of NCCL anyways. This is also true of our own testing. So the lower bound is there to prevent unexpected solves and incompatible environments

One last note wholegraph also has a PyTorch dependency. So would need similar treatment to cuGraph

@jameslamb
Copy link
Member Author

However I think having the right intended behavior trumps that

The issue you've found is at the intersection of PyTorch and NCCL. That's why I support bumping the floor up further to >=2.19 for the libraries that have PyTorch dependencies (and which, conveniently, are not in code freeze yet).

I thought that raft did not have a torch dependency... but was wrong about that 🙃

https://github.com/rapidsai/raft/blob/f37c41c54fc64a4e3689e5a61851ba3821800fee/python/pylibraft/pylibraft/common/outputs.py#L29-L36

So I guess we should do this there too.

@jameslamb
Copy link
Member Author

@jakirkham in the interest of time, I started an internal thread about this.

@jakirkham
Copy link
Member

jakirkham commented Sep 26, 2024

Ok let's start with RAFT since that is going into code freeze

Then we can work on the rest

Edit: Didn't see your comment above when posting this. Will follow the internal thread

@jakirkham
Copy link
Member

Thanks James! 🙏

Now that we have a PR for RAFT ( rapidsai/raft#2458 ), should we work on other cases (like Wholegraph)?

@jameslamb
Copy link
Member Author

Yes, I'll do that right now.

@jameslamb
Copy link
Member Author

jameslamb commented Sep 26, 2024

Think all the updates we want now have associated PRs:

@jakirkham
Copy link
Member

Thanks James! 🙏

Also thank you for organizing that nice list 🙂

rapids-bot bot pushed a commit to rapidsai/wholegraph that referenced this issue Sep 26, 2024
Follow-up to #218 

This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context.

cc @linhu-nv for awareness

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham

URL: #223
rapids-bot bot pushed a commit to rapidsai/raft that referenced this issue Sep 26, 2024
Follow-up to #2443

As part of the work to support NumPy 2 across RAPIDS, we found reason to upgrade some libraries like `cugraph` to slightly newer NCCL (`>=2.19`). Context: rapidsai/build-planning#102 (comment)

This applies that same bump here, to keep the range of NCCL versions consistent across RAPIDS.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - https://github.com/jakirkham
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2458
@jakirkham jakirkham changed the title upgrade to nccl>=2.18.1.1 across RAPIDS upgrade to nccl>=2.19 across RAPIDS Sep 27, 2024
rapids-bot bot pushed a commit to rapidsai/integration that referenced this issue Sep 27, 2024
Follow-up to #723

This bumps the NCCL floor here slightly higher, to `>=2.19`. Part of a RAPIDS-wide update of that floor for the 24.10 release. See rapidsai/build-planning#102 (comment) for context.

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

URL: #726
@jameslamb
Copy link
Member Author

This is complete! Thanks very much for the help @jakirkham @hcho3 @alexbarghi-nv @cjnolet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants