example on single GPU #103

fastalgo · 2017-06-30T22:04:53Z

Can I use NCCL on a single GPU?
If so, can you give me an example?

nluehr · 2017-06-30T22:32:35Z

I'm not sure what you have in mind. NCCL is intended for communication between GPUs. Within a single GPU, you'd be better off with simple CUDA kernels and cudaMemcpy calls.

fastalgo · 2017-06-30T22:36:30Z

Just want to do a test without considering the performance.
Is it possible to use NCCL on a single GPU?

nluehr · 2017-06-30T23:05:36Z

In principle it may be possible. You'll need to control all NCCL ranks from within a single process. Then just repeat your device index in the device list passed to ncclCommInitAll(), and then use different cudaStreams for each rank.
But this is not recommended due to the possibility of creating deadlocks in cases where all kernels do not get scheduled simultaneously on the GPU.

AddyLaddy · 2017-07-01T00:45:05Z

All the NCCL collectives do support being called with a single rank communicator, in which case they simply call the cudaMemcpyAsync() function with the Device to Device flag.

fastalgo · 2017-07-01T04:02:56Z

Thanks so much!

ice-tong · 2022-08-23T02:28:47Z

All the NCCL collectives do support being called with a single rank communicator, in which case they simply call the cudaMemcpyAsync() function with the Device to Device flag.

How about multiple processes in single GPU? ^_^

sjeaugey · 2022-08-23T08:06:35Z

How about multiple processes in single GPU? ^_^

It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error.

ice-tong · 2022-08-24T02:01:33Z

Hi @sjeaugey , I have an all-gather test using NCCL on the same GPU:

import torch
import torch.distributed as torch_dist
import torch.multiprocessing as mp


def _torch_dist_fn(rank, world_size):
    torch_dist.init_process_group(
        backend='gloo',
        init_method=f'tcp://127.0.0.1:2345',
        world_size=world_size,
        rank=rank)
    torch.cuda.set_device(rank % torch.cuda.device_count())
    tensor = torch.Tensor([rank]).cuda()
    tensor_list = [torch.Tensor([0]).cuda() for _ in range(world_size)]
    torch_dist.all_gather(tensor_list, tensor)
    print(f'[rank {rank}]: tensor: {tensor}, tensor_list: {tensor_list}')


def launch_torch_distributed(process_num):
    mp.spawn(_torch_dist_fn, nprocs=process_num, args=(process_num, ))


if __name__=="__main__":
    print('CUDA DEIVCE COUNT:', torch.cuda.device_count())
    print('PYTORCH VERSION:', torch.__version__)
    print('NCCL VERSION:', torch.cuda.nccl.version())
    launch_torch_distributed(2)

Got:

CUDA DEIVCE COUNT: 1
PYTORCH VERSION: 1.9.0+cu111
NCCL VERSION: 2708
[rank 1]: tensor: tensor([1.], device='cuda:0'), tensor_list: [tensor([0.], device='cuda:0'), tensor([1.], device='cuda:0')]
[rank 0]: tensor: tensor([0.], device='cuda:0'), tensor_list: [tensor([0.], device='cuda:0'), tensor([1.], device='cuda:0')]

Is that mean multiple ranks in the same GPU work?

Edited: I wrongly set the backend in init_process_group to 'gloo’. Using 'nccl' got an error. 😢

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

sjeaugey · 2022-08-24T09:24:01Z

If you set NCCL_DEBUG=WARN, NCCL should print something like:

NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device XX

ice-tong · 2022-08-24T09:27:59Z

Thanks for the reply!

christopherhesse · 2022-09-09T21:59:36Z

@sjeaugey I am running into that warning [0] init.cc:545 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 100000, but it would be very handy to run our multi-gpu tests on a smaller number of GPUs.

Is there some configuration which might work where we could have two processes per GPU?

sjeaugey · 2022-09-12T07:17:27Z

Unfortunately not. A large part of the optimization of rings and trees is based on designating the order in which we go through the GPUs and those are identified by their GPU index.

So, on top of being prone do deadlocks, running two ranks on the same GPU is not supported.

christopherhesse · 2022-09-12T17:59:34Z

Got it, thanks!

christopherhesse · 2022-09-12T18:05:15Z

It's my understanding there is some way to split an nvidia GPU such that each VM could see a different "slice" of the GPU, would such a thing potentially work with NCCL?

sjeaugey · 2022-09-13T05:26:57Z

I guess you are referring to MIG (Multi instance GPU). Unfortunately in MIG mode we do not define new GPU indexes so NCCL cannot differentiate the sub GPUs using cudaGetDevice (or some other call).
Maybe if you were to split the GPU and run a different VM on each slice so that nccl considers each slice as a different node (with a single GPU), it would work. You'd be using the network for all communication but I don't see a reason for it not to work. NCCL+MIG is not supported though so no guarantees.

jiangxiaobin96 · 2023-02-03T03:16:07Z

I guess you are referring to MIG (Multi instance GPU). Unfortunately in MIG mode we do not define new GPU indexes so NCCL cannot differentiate the sub GPUs using cudaGetDevice (or some other call). Maybe if you were to split the GPU and run a different VM on each slice so that nccl considers each slice as a different node (with a single GPU), it would work. You'd be using the network for all communication but I don't see a reason for it not to work. NCCL+MIG is not supported though so no guarantees.

if two VMs(two gpu pass through two vm) in one host, how NCCL communicate between VMs, p2p or shm or net? Is any solution to accelerate this communication.

sjeaugey · 2023-02-03T08:15:40Z

It will probably be NET, as we don't see other GPUs, and we currently have no way to know they're on the same physical node and accessible through SHM/P2P.

zxgx · 2024-04-16T05:31:26Z

Hi, @sjeaugey , thank you for your comprehensive explaination. Is there any plan for NCCL supporting MIG or MPS in the near future?

sjeaugey closed this as completed Sep 26, 2018

ice-tong mentioned this issue Aug 24, 2022

[Feat] Add distributed backends and unittests open-mmlab/mmeval#3

Merged

sjeaugey reopened this Sep 13, 2022

sjeaugey closed this as completed Sep 13, 2022

Thomas-MMJ mentioned this issue Nov 7, 2022

[BUG] NCCL rank 1 and rank 0 on same GPU during pytest for test_lr_scheduler.py - single CUDA GPU computer microsoft/DeepSpeed#2482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example on single GPU #103

example on single GPU #103

fastalgo commented Jun 30, 2017

nluehr commented Jun 30, 2017

fastalgo commented Jun 30, 2017

nluehr commented Jun 30, 2017

AddyLaddy commented Jul 1, 2017

fastalgo commented Jul 1, 2017

ice-tong commented Aug 23, 2022

sjeaugey commented Aug 23, 2022 •

edited

Loading

ice-tong commented Aug 24, 2022 •

edited

Loading

sjeaugey commented Aug 24, 2022 •

edited

Loading

ice-tong commented Aug 24, 2022

christopherhesse commented Sep 9, 2022

sjeaugey commented Sep 12, 2022

christopherhesse commented Sep 12, 2022

christopherhesse commented Sep 12, 2022

sjeaugey commented Sep 13, 2022 •

edited

Loading

jiangxiaobin96 commented Feb 3, 2023

sjeaugey commented Feb 3, 2023

zxgx commented Apr 16, 2024

example on single GPU #103

example on single GPU #103

Comments

fastalgo commented Jun 30, 2017

nluehr commented Jun 30, 2017

fastalgo commented Jun 30, 2017

nluehr commented Jun 30, 2017

AddyLaddy commented Jul 1, 2017

fastalgo commented Jul 1, 2017

ice-tong commented Aug 23, 2022

sjeaugey commented Aug 23, 2022 • edited Loading

ice-tong commented Aug 24, 2022 • edited Loading

sjeaugey commented Aug 24, 2022 • edited Loading

ice-tong commented Aug 24, 2022

christopherhesse commented Sep 9, 2022

sjeaugey commented Sep 12, 2022

christopherhesse commented Sep 12, 2022

christopherhesse commented Sep 12, 2022

sjeaugey commented Sep 13, 2022 • edited Loading

jiangxiaobin96 commented Feb 3, 2023

sjeaugey commented Feb 3, 2023

zxgx commented Apr 16, 2024

sjeaugey commented Aug 23, 2022 •

edited

Loading

ice-tong commented Aug 24, 2022 •

edited

Loading

sjeaugey commented Aug 24, 2022 •

edited

Loading

sjeaugey commented Sep 13, 2022 •

edited

Loading