Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colocales+GPUs: some patterns may trigger context destroyed errors #26104

Open
e-kayrakli opened this issue Oct 16, 2024 · 2 comments
Open

colocales+GPUs: some patterns may trigger context destroyed errors #26104

e-kayrakli opened this issue Oct 16, 2024 · 2 comments

Comments

@e-kayrakli
Copy link
Contributor

Consider:

coforall loc in Locales do on loc {
   coforall subloc in here.gpus do on subloc {
     var A : [0..10] int;
     foreach i in 0..10  do A[i] = i * 2;
     writeln("on ", here, " A = ", A);
   }
}

for reasons I cannot completely explain, this tries freeing some GPU memory on a sublocale that doesn't exist.

When this code is run with -nl 1, everything works correctly. But with -nl 1x2 we observe that mapping between device IDs and sublocale IDs is circumvented. That results in trying to free memory in a context that doesn't exist.

More details: Assume you have 4 GPUs per node. Device IDs reported by CUDA/HIP are 0-3. But because of 2 colocales per node, GPU sublocales are 0-1. The latter IDs are what we need while using the GPU runtime, but we end up using the former.

@e-kayrakli
Copy link
Contributor Author

Fix for this is most likely:

diff --git a/runtime/src/gpu/nvidia/gpu-nvidia.c b/runtime/src/gpu/nvidia/gpu-nvidia.c
index d7e93173f3..28f8847925 100644
--- a/runtime/src/gpu/nvidia/gpu-nvidia.c
+++ b/runtime/src/gpu/nvidia/gpu-nvidia.c
@@ -358,7 +365,8 @@ void chpl_gpu_impl_mem_free(void* memAlloc) {
     CUDA_CALL(cuPointerGetAttribute((void*)&dev_id,
                                     CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
                                     (CUdeviceptr)memAlloc));
-    switch_context(dev_id);
+    int ctx_idx = deviceIDToIndex[dev_id];
+    switch_context(ctx_idx);

 #ifdef CHPL_GPU_MEM_STRATEGY_ARRAY_ON_DEVICE
     if (chpl_gpu_impl_is_host_ptr(memAlloc)) {

@jhh67 is working on a related fix for a similar but different issue. We can either lump this fix in his PR, or I can file this separately soon.

@jhh67 jhh67 self-assigned this Oct 17, 2024
@bradcray
Copy link
Member

The code snippet in the OP is an excerpt from test/gpu/native/multiLocale/onAllGpusOnAllLocales.chpl. If we were to add some GPU+co-locale testing, would that help with peace of mind around this going forward?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants