Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

abellina · 2021-02-11T22:25:56Z

Signed-off-by: Alessandro Bellina abellina@nvidia.com

@gerashegalov noticed the following host memory leak in unit tests:

21/02/11 15:56:42.217 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2021-02-11 21:55:41.0075 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:283)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:199)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:34)
com.nvidia.spark.rapids.RapidsBufferCatalog$.init(RapidsBufferCatalog.scala:139)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeRmm(GpuDeviceManager.scala:262)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:292)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:126)

This was because the RapidsHostMemoryStore (and its pool) were not being shutdown. The included here removes the leak from the tests.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

gerashegalov

LGTM

gerashegalov · 2021-02-11T22:39:20Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala

@@ -210,6 +210,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
  override def shutdown(): Unit = {
    GpuSemaphore.shutdown()
    PythonWorkerSemaphore.shutdown()
+    GpuDeviceManager.shutdown()


I wonder if we should make these classes (Auto)Closable? then we could use hadoop's IOUtils
to IOUtils.cleanup(null, GpuSemaphore, PythonWorkerSemaphore, GpuDeviceManager)

So we actually have safeClose already. safeClose will call .close() on every member of the collection, but then will throw at the end if there was a problem closing any member. I can certainly do this as part of this PR, the classes we care about are exactly these three.

Agree not necessary for this PR.

Ok lets skip it for now, we can follow up with a cleanup PR.

abellina · 2021-02-11T23:11:46Z

build

…own (NVIDIA#1713) * Call GpuDeviceManager.shutdown when the executor plugin is shutting down Signed-off-by: Alessandro Bellina <abellina@nvidia.com> * Update copyright

Call GpuDeviceManager.shutdown when the executor plugin is shutting down

0749de3

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

abellina requested review from jlowe and gerashegalov February 11, 2021 22:26

sameerz added the bug Something isn't working label Feb 11, 2021

jlowe reviewed Feb 11, 2021

View reviewed changes

sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala Show resolved Hide resolved

Update copyright

c50bd36

jlowe approved these changes Feb 11, 2021

View reviewed changes

gerashegalov approved these changes Feb 11, 2021

View reviewed changes

abellina merged commit 47a880a into NVIDIA:branch-0.4 Feb 12, 2021

abellina deleted the call_shutdown_on_gpu_device_manager branch February 12, 2021 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

abellina commented Feb 11, 2021

gerashegalov left a comment

gerashegalov Feb 11, 2021

abellina Feb 11, 2021

jlowe Feb 11, 2021

abellina Feb 11, 2021

abellina commented Feb 11, 2021

Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

Conversation

abellina commented Feb 11, 2021

gerashegalov left a comment

Choose a reason for hiding this comment

gerashegalov Feb 11, 2021

Choose a reason for hiding this comment

abellina Feb 11, 2021

Choose a reason for hiding this comment

jlowe Feb 11, 2021

Choose a reason for hiding this comment

abellina Feb 11, 2021

Choose a reason for hiding this comment

abellina commented Feb 11, 2021