-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using CUDA shared memory does not appear to work. #9293
Comments
@phalexo thank you for your submission. Please update your example to fix the syntax formatting and add appropriate imports such that the example can be copied and pasted and run without further code modifications, thank you. |
from numba import cuda
import multiprocessing
from multiprocessing import shared_memory
from functools import partial
import numpy as np
import sys, os
N = 50304
M = N
def allocate_work(queue):
for i in range(30):
queue.put(i)
def process_queue_items(args, i):
(queue, dct) = args
# Get the shared memory and set it up as a numpy array
shm = shared_memory.SharedMemory("SharedVectors")
data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)
# Set up d_data(s) buffer(s) on GPUS
d_data = {}
gpus = len(cuda.list_devices())
PARTN = N//gpus
for cid in range(gpus):
cuda.select_device(cid) # bind device to thread
device = cuda.get_current_device() # get current device
try:
# We want to take in the CUDA handles that the parent process prepared with get_ipc_handle()
# convert them to array to use normally
#with dct[cid] as ipc_array: # One possible option, but does not work.
with cuda.open_ipc_array(dct[cid], shape=(5,), dtype='float32') as ipc_array: # second option, still does not work.
# I expect to get an array that I can use in a numba/cuda kernel
d_data[cid] = ipc_array
print(d_data[cid])
#dt = d_data[cid].args[0].copy_to_host(stream=cuda.stream())
except Exception as e:
print(f"thread/process ids, cid = {cid}, i = {i}, {e}")
if __name__ == "__main__":
slaves = 10
shm = shared_memory.SharedMemory(name="SharedVectors", create=True, size=4*N*M)
data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)
#main_cuda(data)
ctx = multiprocessing.get_context(method='forkserver')
with ctx.Pool(processes=slaves) as pool:
manager = ctx.Manager()
queue = manager.Queue()
dct = manager.dict()
allocate_work(queue)
# Set up arrays on GPUs, one array per GPU
# We want 10 pool processes to be able to access the 4 GPU buffers
gpus = len(cuda.list_devices())
PARTN = N//gpus
for cid in range(gpus):
cuda.select_device(cid) # bind device to thread
dct[cid] = cuda.to_device(np.ndarray([1,2,3,4,5], dtype='float32')).get_ipc_handle()
results = pool.map(partial(process_queue_items, (queue, dct)), range(slaves))
shm.close()
shm.unlink() |
@phalexo I just edited your comment - you need to use three backticks for multi-line code. I'm just looking into this now. |
I have made some progress with this simple code. It is possible it is
interacting with something else.
…On Tue, Nov 21, 2023, 6:00 PM Graham Markall ***@***.***> wrote:
@phalexo <https://github.com/phalexo> I just edited your comment - you
need to use three backticks for multi-line code.
I'm just looking into this now.
—
Reply to this email directly, view it on GitHub
<#9293 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABDD3ZMTI22STCMQUT7S73DYFUXADAVCNFSM6AAAAAA7TT7BUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRHAZTCOJYGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
OK. I'd like to leave this issue open, because:
I partially agree with this: the documentation is bad. However, the API is also bad! The APIs that this area of Numba wraps are known as "Legacy IPC" so I'm a little bit disinclined to actually try and improve it. There is another more modern API for IPC, which might be usable directly with the NVIDIA CUDA Python bindings. I'll check into whether it's usable without any further modification, and post back with an update. |
Not quite sure what to label this, because we don't have a "API is bad" label... So I'll just go with "bug" for now. |
|
@gmarkall There is some kind of strange dependence on the size of an array that needs to be shared. It is way below the GPU VRAM (12.2MiB) but still the the code throws an exception. I have zero idea what opening a handle has to do with the size of an array. The array is successfully created to begin with, a handle is created too. It is just opening of handles is a problem with the larger size of the underlying array. thread/process ids, cid = 0, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE |
Are you sure the original arrays you're sharing are not getting garbage collected due to going out of scope everywhere? |
@gmarkall The arrays are created in the second to last line. I don't see why they would be going out of scope since the pool should be still running until the results are returned.
|
@gmarkall I am totally confused as to what is going on. It appears that I can allocate a buffer (data I compute) about 2.5GiB (12.2GiB on GPUs) but when I try to allocate a second buffer (output) of the same size, for a total of 4.9GiB (way below VRAM size) something breaks and I can no longer open those handles. It does have something to do with garbage collection. When I put gc.disable() at the beginning, I can allocate both buffers, and I don't get the exception opening the array handles. This does appear to be a genuine bug, not just a documentation problem. |
This appears to be the only documentation that I can find on how to share the same CUDA memory block between 2 or more
Linux processes. (https://numba.readthedocs.io/en/stable/cuda/ipc.html)
Although I can do something like
for cid in range(gpus):
cuda.select_device(cid) # bind device to thread
dct[cid] = pickle.dumps(cuda.to_device(np.ascontiguousarray(data[0: M, cid * PARTN: cid * PARTN + PARTN])).get_ipc_handle())
I am unable to create an array over the same memory block in another process like.
I get an exception. I have also tried to process "ipc_array" through pickle.loads(ipc_array) without more success.
Since the documentation is really bad, it is not clear what is going on.
The text was updated successfully, but these errors were encountered: