Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call GpuDeviceManager.shutdown when the executor plugin is shutting down #1713

Merged

Conversation

abellina
Copy link
Collaborator

Signed-off-by: Alessandro Bellina abellina@nvidia.com

@gerashegalov noticed the following host memory leak in unit tests:

21/02/11 15:56:42.217 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2021-02-11 21:55:41.0075 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:283)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:199)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:34)
com.nvidia.spark.rapids.RapidsBufferCatalog$.init(RapidsBufferCatalog.scala:139)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeRmm(GpuDeviceManager.scala:262)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeMemory(GpuDeviceManager.scala:292)
com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:126)

This was because the RapidsHostMemoryStore (and its pool) were not being shutdown. The included here removes the leak from the tests.

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>
@sameerz sameerz added the bug Something isn't working label Feb 11, 2021
Copy link
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -210,6 +210,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
override def shutdown(): Unit = {
GpuSemaphore.shutdown()
PythonWorkerSemaphore.shutdown()
GpuDeviceManager.shutdown()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should make these classes (Auto)Closable? then we could use hadoop's IOUtils
to IOUtils.cleanup(null, GpuSemaphore, PythonWorkerSemaphore, GpuDeviceManager)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we actually have safeClose already. safeClose will call .close() on every member of the collection, but then will throw at the end if there was a problem closing any member. I can certainly do this as part of this PR, the classes we care about are exactly these three.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree not necessary for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok lets skip it for now, we can follow up with a cleanup PR.

@abellina
Copy link
Collaborator Author

build

@abellina abellina merged commit 47a880a into NVIDIA:branch-0.4 Feb 12, 2021
@abellina abellina deleted the call_shutdown_on_gpu_device_manager branch February 12, 2021 01:50
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…own (NVIDIA#1713)

* Call GpuDeviceManager.shutdown when the executor plugin is shutting down

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

* Update copyright
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
…own (NVIDIA#1713)

* Call GpuDeviceManager.shutdown when the executor plugin is shutting down

Signed-off-by: Alessandro Bellina <abellina@nvidia.com>

* Update copyright
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants