Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using CUDA shared memory does not appear to work. #9293

Open
phalexo opened this issue Nov 20, 2023 · 11 comments
Open

Using CUDA shared memory does not appear to work. #9293

phalexo opened this issue Nov 20, 2023 · 11 comments
Assignees
Labels
bug CUDA CUDA related issue/PR doc

Comments

@phalexo
Copy link

phalexo commented Nov 20, 2023

This appears to be the only documentation that I can find on how to share the same CUDA memory block between 2 or more
Linux processes. (https://numba.readthedocs.io/en/stable/cuda/ipc.html)

Although I can do something like
for cid in range(gpus):
cuda.select_device(cid) # bind device to thread
dct[cid] = pickle.dumps(cuda.to_device(np.ascontiguousarray(data[0: M, cid * PARTN: cid * PARTN + PARTN])).get_ipc_handle())

I am unable to create an array over the same memory block in another process like.

gpus = len(cuda.list_devices())
PARTN = N//gpus
for cid in range(gpus):
    cuda.select_device(cid)                    # bind device to thread
    device = cuda.get_current_device()         # get current device
    try:
        with dct[cid] as ipc_array:
            d_data[cid] = cuda.open_ipc_array(ipc_array, shape=(M, PARTN), dtype=float32)
            #print(d_data[cid].copy_to_host()[:10,:10])
    except Exception as e:
        print(f"thread/process ids, cid = {cid}, i = {i}, {e}")

I get an exception. I have also tried to process "ipc_array" through pickle.loads(ipc_array) without more success.

Since the documentation is really bad, it is not clear what is going on.

@esc
Copy link
Member

esc commented Nov 21, 2023

@phalexo thank you for your submission. Please update your example to fix the syntax formatting and add appropriate imports such that the example can be copied and pasted and run without further code modifications, thank you.

@esc esc added more info needed This issue needs more information CUDA CUDA related issue/PR labels Nov 21, 2023
@gmarkall gmarkall self-assigned this Nov 21, 2023
@phalexo
Copy link
Author

phalexo commented Nov 21, 2023

@gmarkall

from numba import cuda
import multiprocessing
from multiprocessing import shared_memory
from functools import partial
import numpy as np
import sys, os

N = 50304
M = N

def allocate_work(queue):
    for i in range(30):
        queue.put(i)

def process_queue_items(args, i):

    (queue, dct) = args

    # Get the shared memory and set it up as a numpy array
    shm = shared_memory.SharedMemory("SharedVectors")
    data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)

    # Set up d_data(s) buffer(s) on GPUS
    d_data = {}
    gpus = len(cuda.list_devices())
    PARTN = N//gpus
    for cid in range(gpus):
        cuda.select_device(cid)                    # bind device to thread
        device = cuda.get_current_device()         # get current device
        try:

            # We want to take in the CUDA handles that the parent process prepared with get_ipc_handle()
            # convert them to array to use normally

            #with dct[cid] as ipc_array: # One possible option, but does not work.
            with cuda.open_ipc_array(dct[cid], shape=(5,), dtype='float32') as ipc_array: # second option, still does not work.

                # I expect to get an array that I can use in a numba/cuda kernel
                d_data[cid] = ipc_array
                print(d_data[cid])

                #dt = d_data[cid].args[0].copy_to_host(stream=cuda.stream())

        except Exception as e:
            print(f"thread/process ids, cid = {cid}, i = {i}, {e}")

if __name__ == "__main__":

    slaves = 10

    shm = shared_memory.SharedMemory(name="SharedVectors", create=True, size=4*N*M)
    data = np.ndarray((N,M), dtype=np.float32, buffer=shm.buf)
    #main_cuda(data)

    ctx = multiprocessing.get_context(method='forkserver')
    with ctx.Pool(processes=slaves) as pool:

        manager = ctx.Manager()
        queue = manager.Queue()
        dct = manager.dict()
        allocate_work(queue)

        # Set up arrays on GPUs, one array per GPU
        # We want 10 pool processes to be able to access the 4 GPU buffers

        gpus = len(cuda.list_devices())
        PARTN = N//gpus
        for cid in range(gpus):
            cuda.select_device(cid)                    # bind device to thread
            dct[cid] = cuda.to_device(np.ndarray([1,2,3,4,5], dtype='float32')).get_ipc_handle()

        results = pool.map(partial(process_queue_items, (queue, dct)), range(slaves))

    shm.close()
    shm.unlink()

@gmarkall
Copy link
Member

@phalexo I just edited your comment - you need to use three backticks for multi-line code.

I'm just looking into this now.

@phalexo
Copy link
Author

phalexo commented Nov 21, 2023 via email

@gmarkall
Copy link
Member

OK. I'd like to leave this issue open, because:

Since the documentation is really bad, it is not clear what is going on.

I partially agree with this: the documentation is bad. However, the API is also bad!

The APIs that this area of Numba wraps are known as "Legacy IPC" so I'm a little bit disinclined to actually try and improve it. There is another more modern API for IPC, which might be usable directly with the NVIDIA CUDA Python bindings. I'll check into whether it's usable without any further modification, and post back with an update.

@gmarkall gmarkall added bug doc and removed more info needed This issue needs more information labels Nov 21, 2023
@gmarkall
Copy link
Member

Not quite sure what to label this, because we don't have a "API is bad" label... So I'll just go with "bug" for now.

@phalexo
Copy link
Author

phalexo commented Nov 22, 2023

Not quite sure what to label this, because we don't have a "API is bad" label... So I'll just go with "bug" for now.

# The upper line does not work
arr = cuda.to_device(np.ndarray(shape=(N,M), dtype=np.float32, buffer=shm.buf)).get_ipc_handle()

# This appears to work. np.array supposedly returns an instance of np.ndarray
arr = cuda.to_device(np.array([ [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0], [1.0,2.0,3.0,4.0,5.0] ], dtype=np.float32)).get_ipc_handle()

# Using this now
with dct[cid] as ipc_array: 

@phalexo
Copy link
Author

phalexo commented Nov 22, 2023

@gmarkall There is some kind of strange dependence on the size of an array that needs to be shared. It is way below the GPU VRAM (12.2MiB) but still the the code throws an exception. I have zero idea what opening a handle has to do with the size of an array. The array is successfully created to begin with, a handle is created too. It is just opening of handles is a problem with the larger size of the underlying array.

thread/process ids, cid = 0, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 0, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 1, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 1, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 2, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 2, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 3, i = 0, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE
thread/process ids, cid = 3, i = 1, [1] Call to call_cuIpcOpenMemHandle results in CUDA_ERROR_INVALID_VALUE

@gmarkall
Copy link
Member

Are you sure the original arrays you're sharing are not getting garbage collected due to going out of scope everywhere?

@phalexo
Copy link
Author

phalexo commented Nov 23, 2023

@gmarkall The arrays are created in the second to last line. I don't see why they would be going out of scope since the pool should be still running until the results are returned.

if __name__ == "__main__":

    slaves = 2
    histogram_total = {}


    shm = shared_memory.SharedMemory(name="SharedVectors", create=True, size=4*N*M)
    data = np.ndarray(shape=(N,M), dtype=np.float32, buffer=shm.buf)
    main_cuda(data)
    #print(data[:10,:10])


    ctx = multiprocessing.get_context(method='forkserver')
    with ctx.Pool(processes=slaves) as pool:

        manager = ctx.Manager()
        queue = manager.Queue()
        dct = manager.dict()
        allocate_work(corpus_path, queue)
        print(corpus_path)

        # Set up d_data(s) buffer(s) on GPUS
        gpus = len(cuda.list_devices())
        PARTN = N//gpus
        for cid in range(gpus):
            cuda.select_device(cid)                    # bind device to thread
            dct[cid] = cuda.to_device(np.ascontiguousarray(data[:M,  cid * PARTN: cid * PARTN + PARTN])).get_ipc_handle()

        results = pool.map(partial(process_queue_items, (queue, dct)), range(slaves))

@phalexo
Copy link
Author

phalexo commented Nov 24, 2023

@gmarkall I am totally confused as to what is going on. It appears that I can allocate a buffer (data I compute) about 2.5GiB (12.2GiB on GPUs) but when I try to allocate a second buffer (output) of the same size, for a total of 4.9GiB (way below VRAM size) something breaks and I can no longer open those handles.

It does have something to do with garbage collection.

When I put gc.disable() at the beginning, I can allocate both buffers, and I don't get the exception opening the array handles.

This does appear to be a genuine bug, not just a documentation problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug CUDA CUDA related issue/PR doc
Projects
None yet
Development

No branches or pull requests

3 participants