Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Provides a method for the user to remove the hook and re-register the…
… hook in a custom shutdown hook manager (#11161) Contributes to NVIDIA/spark-rapids#5854 ### Problem Prints `RapidsHostMemoryStore.pool` leaked error log when running Rapids Accelerator test cases. ``` All tests passed. 22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010) 22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC java.lang.Thread.getStackTrace(Thread.java:1559) ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301) ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82) ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232) ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98) ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196) ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192) ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144) com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38) ``` ### Root cause `RapidsHostMemoryStore.pool` is not closed before `MemoryCleaner` checking the leaks. It's actually not a leak, it's caused by hooks execution order. `RapidsHostMemoryStore.pool` is closed in the [Spark executor plugin hook](https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/executor/Executor.scala#L351toL381). ``` plugins.foreach(_.shutdown()) // this line will eventually close the RapidsHostMemoryStore.pool ``` The close path is: ``` The close path is: Spark executor plugin hook -> RapidsExecutorPlugin.shutdown -> GpuDeviceManager.shutdown -> RapidsBufferCatalog.close() -> RapidsHostMemoryStore.close -> RapidsHostMemoryStore.pool.close -> ``` Rapids Accelerator JNI also checks leaks in a shutdown hook. Shutdown hooks are executed concurrently, there is no execution order guarantee. ### solution 1 - Not recommanded Just wait one second before checking the leak in the `MemoryCleaner`. It's modifying debug code, it's modifying closing code, and has no impact on production code. ### solution 2 - Not recommanded Spark has a util class `ShutdownHookManager` which is a ShutdownHook wrapper. It can [addShutdownHook with priority](https://github.com/apache/spark/blob/v3.3.0/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L152) via `Hadoop ShutdownHookManager` ``` def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = { ``` Leveraging Hadoop ShutdownHookManager as Spark does is feasible. ### Solution 3 Recommanded Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager. Signed-off-by: Chong Gao <res_life@163.com> Authors: - Chong Gao (https://github.com/res-life) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) URL: #11161
- Loading branch information