-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
example on single GPU #103
Comments
I'm not sure what you have in mind. NCCL is intended for communication between GPUs. Within a single GPU, you'd be better off with simple CUDA kernels and cudaMemcpy calls. |
Just want to do a test without considering the performance. |
In principle it may be possible. You'll need to control all NCCL ranks from within a single process. Then just repeat your device index in the device list passed to ncclCommInitAll(), and then use different cudaStreams for each rank. |
All the NCCL collectives do support being called with a single rank communicator, in which case they simply call the cudaMemcpyAsync() function with the Device to Device flag. |
Thanks so much! |
How about multiple processes in single GPU? ^_^ |
It is no longer possible to have multiple ranks use the same GPU since NCCL 2.5. It will return an error. |
Hi @sjeaugey , I have an all-gather test using NCCL on the same GPU: import torch
import torch.distributed as torch_dist
import torch.multiprocessing as mp
def _torch_dist_fn(rank, world_size):
torch_dist.init_process_group(
backend='gloo',
init_method=f'tcp://127.0.0.1:2345',
world_size=world_size,
rank=rank)
torch.cuda.set_device(rank % torch.cuda.device_count())
tensor = torch.Tensor([rank]).cuda()
tensor_list = [torch.Tensor([0]).cuda() for _ in range(world_size)]
torch_dist.all_gather(tensor_list, tensor)
print(f'[rank {rank}]: tensor: {tensor}, tensor_list: {tensor_list}')
def launch_torch_distributed(process_num):
mp.spawn(_torch_dist_fn, nprocs=process_num, args=(process_num, ))
if __name__=="__main__":
print('CUDA DEIVCE COUNT:', torch.cuda.device_count())
print('PYTORCH VERSION:', torch.__version__)
print('NCCL VERSION:', torch.cuda.nccl.version())
launch_torch_distributed(2) Got:
Is that mean multiple ranks in the same GPU work? Edited: I wrongly set the backend in
|
If you set
|
Thanks for the reply! |
@sjeaugey I am running into that warning Is there some configuration which might work where we could have two processes per GPU? |
Unfortunately not. A large part of the optimization of rings and trees is based on designating the order in which we go through the GPUs and those are identified by their GPU index. So, on top of being prone do deadlocks, running two ranks on the same GPU is not supported. |
Got it, thanks! |
It's my understanding there is some way to split an nvidia GPU such that each VM could see a different "slice" of the GPU, would such a thing potentially work with NCCL? |
I guess you are referring to MIG (Multi instance GPU). Unfortunately in MIG mode we do not define new GPU indexes so NCCL cannot differentiate the sub GPUs using cudaGetDevice (or some other call). |
if two VMs(two gpu pass through two vm) in one host, how NCCL communicate between VMs, p2p or shm or net? Is any solution to accelerate this communication. |
It will probably be NET, as we don't see other GPUs, and we currently have no way to know they're on the same physical node and accessible through SHM/P2P. |
Hi, @sjeaugey , thank you for your comprehensive explaination. Is there any plan for NCCL supporting MIG or MPS in the near future? |
Can I use NCCL on a single GPU?
If so, can you give me an example?
The text was updated successfully, but these errors were encountered: