colocales+GPUs: some patterns may trigger `context destroyed` errors #26104

e-kayrakli · 2024-10-16T23:16:56Z

Consider:

coforall loc in Locales do on loc {
   coforall subloc in here.gpus do on subloc {
     var A : [0..10] int;
     foreach i in 0..10  do A[i] = i * 2;
     writeln("on ", here, " A = ", A);
   }
}

for reasons I cannot completely explain, this tries freeing some GPU memory on a sublocale that doesn't exist.

When this code is run with -nl 1, everything works correctly. But with -nl 1x2 we observe that mapping between device IDs and sublocale IDs is circumvented. That results in trying to free memory in a context that doesn't exist.

More details: Assume you have 4 GPUs per node. Device IDs reported by CUDA/HIP are 0-3. But because of 2 colocales per node, GPU sublocales are 0-1. The latter IDs are what we need while using the GPU runtime, but we end up using the former.

The text was updated successfully, but these errors were encountered:

e-kayrakli · 2024-10-16T23:18:04Z

Fix for this is most likely:

diff --git a/runtime/src/gpu/nvidia/gpu-nvidia.c b/runtime/src/gpu/nvidia/gpu-nvidia.c
index d7e93173f3..28f8847925 100644
--- a/runtime/src/gpu/nvidia/gpu-nvidia.c
+++ b/runtime/src/gpu/nvidia/gpu-nvidia.c
@@ -358,7 +365,8 @@ void chpl_gpu_impl_mem_free(void* memAlloc) {
     CUDA_CALL(cuPointerGetAttribute((void*)&dev_id,
                                     CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
                                     (CUdeviceptr)memAlloc));
-    switch_context(dev_id);
+    int ctx_idx = deviceIDToIndex[dev_id];
+    switch_context(ctx_idx);

 #ifdef CHPL_GPU_MEM_STRATEGY_ARRAY_ON_DEVICE
     if (chpl_gpu_impl_is_host_ptr(memAlloc)) {

@jhh67 is working on a related fix for a similar but different issue. We can either lump this fix in his PR, or I can file this separately soon.

bradcray · 2024-10-17T22:44:17Z

The code snippet in the OP is an excerpt from test/gpu/native/multiLocale/onAllGpusOnAllLocales.chpl. If we were to add some GPU+co-locale testing, would that help with peace of mind around this going forward?

e-kayrakli added type: Bug area: Runtime user issue area: GPU Support labels Oct 16, 2024

jhh67 self-assigned this Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colocales+GPUs: some patterns may trigger `context destroyed` errors #26104

colocales+GPUs: some patterns may trigger `context destroyed` errors #26104

e-kayrakli commented Oct 16, 2024

e-kayrakli commented Oct 16, 2024

bradcray commented Oct 17, 2024

colocales+GPUs: some patterns may trigger context destroyed errors #26104

colocales+GPUs: some patterns may trigger context destroyed errors #26104

Comments

e-kayrakli commented Oct 16, 2024

e-kayrakli commented Oct 16, 2024

bradcray commented Oct 17, 2024

colocales+GPUs: some patterns may trigger `context destroyed` errors #26104

colocales+GPUs: some patterns may trigger `context destroyed` errors #26104