Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example on single GPU #103

Closed
fastalgo opened this issue Jun 30, 2017 · 18 comments
Closed

example on single GPU #103

fastalgo opened this issue Jun 30, 2017 · 18 comments

Comments

@fastalgo
Copy link

Can I use NCCL on a single GPU?
If so, can you give me an example?

@nluehr
Copy link
Contributor

nluehr commented Jun 30, 2017

I'm not sure what you have in mind. NCCL is intended for communication between GPUs. Within a single GPU, you'd be better off with simple CUDA kernels and cudaMemcpy calls.

@fastalgo
Copy link
Author

Just want to do a test without considering the performance.
Is it possible to use NCCL on a single GPU?

@nluehr
Copy link
Contributor

nluehr commented Jun 30, 2017

In principle it may be possible. You'll need to control all NCCL ranks from within a single process. Then just repeat your device index in the device list passed to ncclCommInitAll(), and then use different cudaStreams for each rank.
But this is not recommended due to the possibility of creating deadlocks in cases where all kernels do not get scheduled simultaneously on the GPU.

@AddyLaddy
Copy link
Collaborator

All the NCCL collectives do support being called with a single rank communicator, in which case they simply call the cudaMemcpyAsync() function with the Device to Device flag.

@fastalgo
Copy link
Author

fastalgo commented Jul 1, 2017

Thanks so much!

@ice-tong
Copy link

All the NCCL collectives do support being called with a single rank communicator, in which case they simply call the cudaMemcpyAsync() function with the Device to Device flag.

How about multiple processes in single GPU? ^_^

@sjeaugey
Copy link
Member

sjeaugey commented Aug 23, 2022

How about multiple processes in single GPU? ^_^

It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error.

@ice-tong
Copy link

ice-tong commented Aug 24, 2022

Hi @sjeaugey , I have an all-gather test using NCCL on the same GPU:

import torch
import torch.distributed as torch_dist
import torch.multiprocessing as mp


def _torch_dist_fn(rank, world_size):
    torch_dist.init_process_group(
        backend='gloo',
        init_method=f'tcp://127.0.0.1:2345',
        world_size=world_size,
        rank=rank)
    torch.cuda.set_device(rank % torch.cuda.device_count())
    tensor = torch.Tensor([rank]).cuda()
    tensor_list = [torch.Tensor([0]).cuda() for _ in range(world_size)]
    torch_dist.all_gather(tensor_list, tensor)
    print(f'[rank {rank}]: tensor: {tensor}, tensor_list: {tensor_list}')


def launch_torch_distributed(process_num):
    mp.spawn(_torch_dist_fn, nprocs=process_num, args=(process_num, ))


if __name__=="__main__":
    print('CUDA DEIVCE COUNT:', torch.cuda.device_count())
    print('PYTORCH VERSION:', torch.__version__)
    print('NCCL VERSION:', torch.cuda.nccl.version())
    launch_torch_distributed(2)

Got:

CUDA DEIVCE COUNT: 1
PYTORCH VERSION: 1.9.0+cu111
NCCL VERSION: 2708
[rank 1]: tensor: tensor([1.], device='cuda:0'), tensor_list: [tensor([0.], device='cuda:0'), tensor([1.], device='cuda:0')]
[rank 0]: tensor: tensor([0.], device='cuda:0'), tensor_list: [tensor([0.], device='cuda:0'), tensor([1.], device='cuda:0')]

Is that mean multiple ranks in the same GPU work?

Edited: I wrongly set the backend in init_process_group to 'gloo’. Using 'nccl' got an error. 😢

ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

@sjeaugey
Copy link
Member

sjeaugey commented Aug 24, 2022

If you set NCCL_DEBUG=WARN, NCCL should print something like:

NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device XX

@ice-tong
Copy link

Thanks for the reply!

@christopherhesse
Copy link
Contributor

@sjeaugey I am running into that warning [0] init.cc:545 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 100000, but it would be very handy to run our multi-gpu tests on a smaller number of GPUs.

Is there some configuration which might work where we could have two processes per GPU?

@sjeaugey
Copy link
Member

Unfortunately not. A large part of the optimization of rings and trees is based on designating the order in which we go through the GPUs and those are identified by their GPU index.

So, on top of being prone do deadlocks, running two ranks on the same GPU is not supported.

@christopherhesse
Copy link
Contributor

Got it, thanks!

@christopherhesse
Copy link
Contributor

It's my understanding there is some way to split an nvidia GPU such that each VM could see a different "slice" of the GPU, would such a thing potentially work with NCCL?

@sjeaugey
Copy link
Member

sjeaugey commented Sep 13, 2022

I guess you are referring to MIG (Multi instance GPU). Unfortunately in MIG mode we do not define new GPU indexes so NCCL cannot differentiate the sub GPUs using cudaGetDevice (or some other call).
Maybe if you were to split the GPU and run a different VM on each slice so that nccl considers each slice as a different node (with a single GPU), it would work. You'd be using the network for all communication but I don't see a reason for it not to work. NCCL+MIG is not supported though so no guarantees.

@jiangxiaobin96
Copy link

I guess you are referring to MIG (Multi instance GPU). Unfortunately in MIG mode we do not define new GPU indexes so NCCL cannot differentiate the sub GPUs using cudaGetDevice (or some other call). Maybe if you were to split the GPU and run a different VM on each slice so that nccl considers each slice as a different node (with a single GPU), it would work. You'd be using the network for all communication but I don't see a reason for it not to work. NCCL+MIG is not supported though so no guarantees.

if two VMs(two gpu pass through two vm) in one host, how NCCL communicate between VMs, p2p or shm or net? Is any solution to accelerate this communication.

@sjeaugey
Copy link
Member

sjeaugey commented Feb 3, 2023

It will probably be NET, as we don't see other GPUs, and we currently have no way to know they're on the same physical node and accessible through SHM/P2P.

@zxgx
Copy link

zxgx commented Apr 16, 2024

Hi, @sjeaugey , thank you for your comprehensive explaination. Is there any plan for NCCL supporting MIG or MPS in the near future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants