Using CUDA shared memory does not appear to work. #9293

phalexo · 2023-11-20T22:43:37Z

This appears to be the only documentation that I can find on how to share the same CUDA memory block between 2 or more
Linux processes. (https://numba.readthedocs.io/en/stable/cuda/ipc.html)

Although I can do something like
for cid in range(gpus):
cuda.select_device(cid) # bind device to thread
dct[cid] = pickle.dumps(cuda.to_device(np.ascontiguousarray(data[0: M, cid * PARTN: cid * PARTN + PARTN])).get_ipc_handle())

I am unable to create an array over the same memory block in another process like.

gpus = len(cuda.list_devices())
PARTN = N//gpus
for cid in range(gpus):
    cuda.select_device(cid)                    # bind device to thread
    device = cuda.get_current_device()         # get current device
    try:
        with dct[cid] as ipc_array:
            d_data[cid] = cuda.open_ipc_array(ipc_array, shape=(M, PARTN), dtype=float32)
            #print(d_data[cid].copy_to_host()[:10,:10])
    except Exception as e:
        print(f"thread/process ids, cid = {cid}, i = {i}, {e}")

I get an exception. I have also tried to process "ipc_array" through pickle.loads(ipc_array) without more success.

Since the documentation is really bad, it is not clear what is going on.

The text was updated successfully, but these errors were encountered:

esc · 2023-11-21T08:03:54Z

@phalexo thank you for your submission. Please update your example to fix the syntax formatting and add appropriate imports such that the example can be copied and pasted and run without further code modifications, thank you.

phalexo · 2023-11-21T20:56:53Z

@gmarkall

from numba import cuda
import multiprocessing
from multiprocessing import shared_memory
from functools import partial
import numpy as np
import sys, os

N = 50304
M = N

def allocate_work(queue):
    for i in range(30):
        queue.put(i)

def process_queue_items(args, i):

    (queue, dct) = args

    # Get the shared memory and set it up as a numpy array
    shm = shared_memory.SharedMemory("SharedVectors")
    data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)

    # Set up d_data(s) buffer(s) on GPUS
    d_data = {}
    gpus = len(cuda.list_devices())
    PARTN = N//gpus
    for cid in range(gpus):
        cuda.select_device(cid)                    # bind device to thread
        device = cuda.get_current_device()         # get current device
        try:

            # We want to take in the CUDA handles that the parent process prepared with get_ipc_handle()
            # convert them to array to use normally

            #with dct[cid] as ipc_array: # One possible option, but does not work.
            with cuda.open_ipc_array(dct[cid], shape=(5,), dtype='float32') as ipc_array: # second option, still does not work.

                # I expect to get an array that I can use in a numba/cuda kernel
                d_data[cid] = ipc_array
                print(d_data[cid])

                #dt = d_data[cid].args[0].copy_to_host(stream=cuda.stream())

        except Exception as e:
            print(f"thread/process ids, cid = {cid}, i = {i}, {e}")

if __name__ == "__main__":

    slaves = 10

    shm = shared_memory.SharedMemory(name="SharedVectors", create=True, size=4*N*M)
    data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)
    #main_cuda(data)

    ctx = multiprocessing.get_context(method='forkserver')
    with ctx.Pool(processes=slaves) as pool:

        manager = ctx.Manager()
        queue = manager.Queue()
        dct = manager.dict()
        allocate_work(queue)

        # Set up arrays on GPUs, one array per GPU
        # We want 10 pool processes to be able to access the 4 GPU buffers

        gpus = len(cuda.list_devices())
        PARTN = N//gpus
        for cid in range(gpus):
            cuda.select_device(cid)                    # bind device to thread
            dct[cid] = cuda.to_device(np.ndarray([1,2,3,4,5], dtype='float32')).get_ipc_handle()

        results = pool.map(partial(process_queue_items, (queue, dct)), range(slaves))

    shm.close()
    shm.unlink()

gmarkall · 2023-11-21T23:00:05Z

@phalexo I just edited your comment - you need to use three backticks for multi-line code.

I'm just looking into this now.

phalexo · 2023-11-21T23:06:20Z

I have made some progress with this simple code. It is possible it is interacting with something else.

…

On Tue, Nov 21, 2023, 6:00 PM Graham Markall ***@***.***> wrote: @phalexo <https://github.com/phalexo> I just edited your comment - you need to use three backticks for multi-line code. I'm just looking into this now. — Reply to this email directly, view it on GitHub <#9293 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDD3ZMTI22STCMQUT7S73DYFUXADAVCNFSM6AAAAAA7TT7BUSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRRHAZTCOJYGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

gmarkall · 2023-11-21T23:18:08Z

OK. I'd like to leave this issue open, because:

Since the documentation is really bad, it is not clear what is going on.

I partially agree with this: the documentation is bad. However, the API is also bad!

The APIs that this area of Numba wraps are known as "Legacy IPC" so I'm a little bit disinclined to actually try and improve it. There is another more modern API for IPC, which might be usable directly with the NVIDIA CUDA Python bindings. I'll check into whether it's usable without any further modification, and post back with an update.

gmarkall · 2023-11-21T23:19:52Z

Not quite sure what to label this, because we don't have a "API is bad" label... So I'll just go with "bug" for now.

phalexo · 2023-11-22T00:09:49Z

Not quite sure what to label this, because we don't have a "API is bad" label... So I'll just go with "bug" for now.

# The upper line does not work
arr = cuda.to_device(np.ndarray(shape=(N,M), dtype=np.float32, buffer=shm.buf)).get_ipc_handle()

# This appears to work. np.array supposedly returns an instance of np.ndarray
arr = cuda.to_device(np.array([ [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0] ], dtype=np.float32)).get_ipc_handle()

# Using this now
with dct[cid] as ipc_array:

phalexo · 2023-11-22T20:23:26Z

@gmarkall There is some kind of strange dependence on the size of an array that needs to be shared. It is way below the GPU VRAM (12.2MiB) but still the the code throws an exception. I have zero idea what opening a handle has to do with the size of an array. The array is successfully created to begin with, a handle is created too. It is just opening of handles is a problem with the larger size of the underlying array.

thread/process ids, cid = 0, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 0, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 1, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 1, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 2, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 2, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 3, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 3, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE

gmarkall · 2023-11-23T00:15:28Z

Are you sure the original arrays you're sharing are not getting garbage collected due to going out of scope everywhere?

phalexo · 2023-11-23T00:28:13Z

@gmarkall The arrays are created in the second to last line. I don't see why they would be going out of scope since the pool should be still running until the results are returned.

if __name__ == "__main__":

    slaves = 2
    histogram_total = {}


    shm = shared_memory.SharedMemory(name="SharedVectors", create=True, size=4*N*M)
    data = np.ndarray(shape=(N,M), dtype=np.float32, buffer=shm.buf)
    main_cuda(data)
    #print(data[:10,:10])


    ctx = multiprocessing.get_context(method='forkserver')
    with ctx.Pool(processes=slaves) as pool:

        manager = ctx.Manager()
        queue = manager.Queue()
        dct = manager.dict()
        allocate_work(corpus_path, queue)
        print(corpus_path)

        # Set up d_data(s) buffer(s) on GPUS
        gpus = len(cuda.list_devices())
        PARTN = N//gpus
        for cid in range(gpus):
            cuda.select_device(cid)                    # bind device to thread
            dct[cid] = cuda.to_device(np.ascontiguousarray(data[:M,  cid * PARTN: cid * PARTN + PARTN])).get_ipc_handle()

        results = pool.map(partial(process_queue_items, (queue, dct)), range(slaves))

phalexo · 2023-11-24T18:02:49Z

@gmarkall I am totally confused as to what is going on. It appears that I can allocate a buffer (data I compute) about 2.5GiB (12.2GiB on GPUs) but when I try to allocate a second buffer (output) of the same size, for a total of 4.9GiB (way below VRAM size) something breaks and I can no longer open those handles.

It does have something to do with garbage collection.

When I put gc.disable() at the beginning, I can allocate both buffers, and I don't get the exception opening the array handles.

This does appear to be a genuine bug, not just a documentation problem.

esc added more info needed This issue needs more information CUDA CUDA related issue/PR labels Nov 21, 2023

gmarkall self-assigned this Nov 21, 2023

gmarkall added bug doc and removed more info needed This issue needs more information labels Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using CUDA shared memory does not appear to work. #9293

Using CUDA shared memory does not appear to work. #9293

phalexo commented Nov 20, 2023

esc commented Nov 21, 2023

phalexo commented Nov 21, 2023 •

edited by gmarkall

Loading

gmarkall commented Nov 21, 2023

phalexo commented Nov 21, 2023 via email

gmarkall commented Nov 21, 2023

gmarkall commented Nov 21, 2023

phalexo commented Nov 22, 2023 •

edited

Loading

phalexo commented Nov 22, 2023

gmarkall commented Nov 23, 2023

phalexo commented Nov 23, 2023 •

edited

Loading

phalexo commented Nov 24, 2023 •

edited

Loading

Using CUDA shared memory does not appear to work. #9293

Using CUDA shared memory does not appear to work. #9293

Comments

phalexo commented Nov 20, 2023

esc commented Nov 21, 2023

phalexo commented Nov 21, 2023 • edited by gmarkall Loading

gmarkall commented Nov 21, 2023

phalexo commented Nov 21, 2023 via email

gmarkall commented Nov 21, 2023

gmarkall commented Nov 21, 2023

phalexo commented Nov 22, 2023 • edited Loading

phalexo commented Nov 22, 2023

gmarkall commented Nov 23, 2023

phalexo commented Nov 23, 2023 • edited Loading

phalexo commented Nov 24, 2023 • edited Loading

phalexo commented Nov 21, 2023 •

edited by gmarkall

Loading

phalexo commented Nov 22, 2023 •

edited

Loading

phalexo commented Nov 23, 2023 •

edited

Loading

phalexo commented Nov 24, 2023 •

edited

Loading