Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager #11161

Merged
merged 6 commits into from
Jul 7, 2022

Conversation

res-life
Copy link
Contributor

@res-life res-life commented Jun 28, 2022

Contributes to NVIDIA/spark-rapids#5854

Problem

Prints RapidsHostMemoryStore.pool leaked error log when running Rapids Accelerator test cases.

All tests passed.

22/06/27 17:45:57.298 Thread-7 ERROR HostMemoryBuffer: A HOST BUFFER WAS LEAKED (ID: 1 7f8557fff010)
22/06/27 17:45:57.303 Thread-7 ERROR MemoryCleaner: Leaked host buffer (ID: 1): 2022-06-27 09:45:16.0171 UTC: INC
java.lang.Thread.getStackTrace(Thread.java:1559)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.MemoryBuffer.incRefCount(MemoryBuffer.java:232)
ai.rapids.cudf.MemoryBuffer.<init>(MemoryBuffer.java:98)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:196)
ai.rapids.cudf.HostMemoryBuffer.<init>(HostMemoryBuffer.java:192)
ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:144)
com.nvidia.spark.rapids.RapidsHostMemoryStore.<init>(RapidsHostMemoryStore.scala:38)

Root cause

RapidsHostMemoryStore.pool is not closed before MemoryCleaner checking the leaks.
It's actually not a leak, it's caused by hooks execution order.
RapidsHostMemoryStore.pool is closed in the Spark executor plugin hook.

plugins.foreach(_.shutdown())  // this line will eventually close the RapidsHostMemoryStore.pool

The close path is:

  The close path is: 
    Spark executor plugin hook ->
      RapidsExecutorPlugin.shutdown ->
        GpuDeviceManager.shutdown ->
          RapidsBufferCatalog.close() ->
            RapidsHostMemoryStore.close ->
              RapidsHostMemoryStore.pool.close ->

Rapids Accelerator JNI also checks leaks in a shutdown hook.
Shutdown hooks are executed concurrently, there is no execution order guarantee.

solution 1 - Not recommanded

Just wait one second before checking the leak in the MemoryCleaner.
It's modifying debug code, it's modifying closing code, and has no impact on production code.

solution 2 - Not recommanded

Spark has a util class ShutdownHookManager which is a ShutdownHook wrapper.
It can addShutdownHook with priority via Hadoop ShutdownHookManager

def addShutdownHook(priority: Int)(hook: () => Unit): AnyRef = {

Leveraging Hadoop ShutdownHookManager as Spark does is feasible.

Solution 3 Recommanded

Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager.

Signed-off-by: Chong Gao res_life@163.com

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life requested a review from a team as a code owner June 28, 2022 01:35
@github-actions github-actions bot added the Java Affects Java cuDF API. label Jun 28, 2022
@res-life res-life changed the title Add sleep in the shutdown hook where checking the leak Add sleep in the MemoryCleaner shutdown hook to avoid false leaks Jun 28, 2022
@codecov
Copy link

codecov bot commented Jun 28, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@e0003a0). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.08   #11161   +/-   ##
===============================================
  Coverage                ?   86.29%           
===============================================
  Files                   ?      144           
  Lines                   ?    22698           
  Branches                ?        0           
===============================================
  Hits                    ?    19588           
  Misses                  ?     3110           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0003a0...da8a6b6. Read the comment docs.

@res-life res-life added bug Something isn't working non-breaking Non-breaking change labels Jun 28, 2022
@gerashegalov
Copy link
Contributor

IMO: Using sleep in tests is an anti-pattern, using sleep in production code to make the test code work is a worse anti-pattern.

Shutdown Hook Manager in hadoop-common predates Spark's . We can either switch hadoop-common from test to compile scope or simply copy the idea to cudf-java.

@revans2
Copy link
Contributor

revans2 commented Jun 28, 2022

I agree a sleep is not what we want.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Contributor Author

IMO: Using sleep in tests is an anti-pattern, using sleep in production code to make the test code work is a worse anti-pattern.

Shutdown Hook Manager in hadoop-common predates Spark's . We can either switch hadoop-common from test to compile scope or simply copy the idea to cudf-java.

Good idea, thanks.
Leveraging Hadoop ShutdownHookManager as Spark does is feasible.

@res-life
Copy link
Contributor Author

Introduced a compile scope dependency, does this impact other projects?

If cuDF(java) is only used by Rapids Accelerator it's OK, because Spark runtime classpath definitely contains
org.apache.hadoop.util.ShutdownHookManager class

@res-life res-life changed the title Add sleep in the MemoryCleaner shutdown hook to avoid false leaks Use Hadoop ShutdownHookManager to make sure leak checking is run after the Spark hooks Jun 29, 2022
java/pom.xml Outdated
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.2.3</version>
<scope>test</scope>
<scope>compile</scope>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer provided to compile according to your comment #11161 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree about 'provided' to avoid pulling in Hadoop in the shade plug-in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

databricks class path may not contain org.apache.hadoop.util.ShutdownHookManager. @gerashegalov

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may not? we should see this at the compile time during Db build. Please add [databricks] to the PR title to double-check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -214,10 +214,24 @@ static void setDefaultGpu(int defaultGpuId) {
if (defaultGpu >= 0) {
Cuda.setDevice(defaultGpu);
}

for (CleanerWeakReference cwr : all) {
Copy link
Contributor

@firestarman firestarman Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just got another possible solution.
Cleaner has an API isClean, you can call it to do some check before the real cleaning in CleanerWeakReference.clean to avoid duplicate cleaning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug was caused by leak instead of duplicate cleaning.
The resources should be closed by cleaner.clean(logErrorIfNotClean = false) trigged by withResource clause before this leak checking:

// leak checking in shutdown hook:
for (CleanerWeakReference cwr : all) {
  // this invoke cleaner.clean(logErrorIfNotClean = true)
  // If resource is not closed before, close it and then prints error log.
  cwr.clean(); 
}

@res-life
Copy link
Contributor Author

simply copy the idea to cudf-java.

This seems not work. The Spark hooks and this hook should reside in same Hadoop ShutdownHookManager

@res-life res-life marked this pull request as draft June 29, 2022 03:14
@res-life
Copy link
Contributor Author

rerurn tests

@res-life
Copy link
Contributor Author

Tested databricks 312 and 321, this PR works.

The Hadoop ShutdownHookManager is provided in:

databricks 312
/databricks/jars/----workspace_spark_3_1--maven-trees--hive-2.3__hadoop-2.7--org.apache.hadoop--hadoop-common--org.apache.hadoop__hadoop-common__2.7.4.jar

databricks 321
/databricks/jars/----workspace_spark_3_2--maven-trees--hive-2.3__hadoop-3.2--org.apache.hadoop--hadoop-client-api--org.apache.hadoop__hadoop-client-api__3.3.1-databricks.jar

@res-life res-life marked this pull request as ready for review June 30, 2022 10:03
@res-life
Copy link
Contributor Author

Build failed in cuda, seems it's not introduced by this PR.

// Here also use `Hadoop ShutdownHookManager` to add a lower priority hook.
// 20 priority is small enough, will run after Spark hooks.
// Note: `ShutdownHookManager.get()` is a singleton, Spark and JNI both use the same singleton
org.apache.hadoop.util.ShutdownHookManager.get().addShutdownHook(hook, 20);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like using the Hadoop ShutdownHookManager here. To solve the problem all of the shutdown hooks that might conflict with each other must use this manager too. Which in some cases they do, but in other cases they will not. This is a library and it is not Hadoop specific.

I don't want to over engineer this, but I personally would like to see a way for the user of the library to provide the way for the shutdown code to run, and if none is provided then the regular java Runtime is used. But this is hard because all of the code is running inside a static block. We might need to first register the hook with the java Runtime and then remove it if/when someone tells us to use a different shutdown mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/ We can create a Runnable hook without automatically adding it to Runtime.

2/ We can add a Boolean system property for whether to add shutdown hook directly to the Runtime.

3/ Then have a contract MemroryCleaner.getCleanerHook() that returns non-null from 1 if the Boolean from 2 is false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a great solution to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already had one Boolean REF_COUNT_DEBUG:
https://github.com/rapidsai/cudf/blob/branch-22.08/java/src/main/java/ai/rapids/cudf/MemoryCleaner.java#L203

    if (REF_COUNT_DEBUG) {
      // If we are debugging things do a best effort to check for leaks at the end
      Runtime.getRuntime().addShutdownHook(new Thread(() -> {
......
        for (CleanerWeakReference cwr : all) {
          cwr.clean();
        }
      }));
    }

If disable REF_COUNT_DEBUG, there are no fake error logs.
We can create a new one like REF_COUNT_DEBUG_WHEN_CLOSE to replace it.
But still have this problem if enables the boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my interpretation ai.rapids.refcount.debug is here to determine whether to check for leaks at shutdown time, and not about how to achieve this.

So either another Boolean switch to the tune of ai.rapids.refcount.debug.addShutdownHook
or we can change ai.rapids.refcount.debug to a String property with valid values to the tune of: "none", "withInternalShutdownHook", "withExternalShutdownHook" . We can use an Enum to manage them.

@res-life res-life marked this pull request as draft July 5, 2022 03:31
Signed-off-by: Chong Gao <res_life@163.com>
@res-life
Copy link
Contributor Author

res-life commented Jul 6, 2022

Pushed a new commit:
Provided a method for the user to remove the hook and register the hook in a custom shutdown hook manager

RAPIDS Accelerator side change:
NVIDIA/spark-rapids#5955

@res-life res-life marked this pull request as ready for review July 6, 2022 09:02
cwr.clean();
}
}));
Runtime.getRuntime().addShutdownHook(DEFAULT_SHUTDOWN_THREAD);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pro: we are getting away without adding more switches
con: we are risking race conditions depending on the upper layer

* Runnable used to be added to Java default shutdown hook.
* It checks the leaks at shutdown time.
* If you want to register the Runnable in a custom shutdown hook manager instead of Java default shutdown hook,
* should first remove it and then add it.
Copy link
Contributor

@gerashegalov gerashegalov Jul 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to the API for remove

Suggested change
* should first remove it and then add it.
* should first remove it using {@link #removeDefaultShutdownHook()} and then add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
}

public static void removeDefaultShutdownHook() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javadoc for the public method?

to reduce the risk of race conditions we could change the signature of this method

public static Runnable removeDefaultShutdownHook() 

after removing DEFAULT_SHUTDOWN_THREAD it would return DEFAULT_SHUTDOWN_RUNNABLE that the caller can register in their shutdown manager

DEFAULT_SHUTDOWN_RUNNABLE can be private then

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@res-life res-life changed the title Use Hadoop ShutdownHookManager to make sure leak checking is run after the Spark hooks Provides a method for the user to remove the hook and re-register the hook in a custom shutdown hook manager Jul 7, 2022
@res-life
Copy link
Contributor Author

res-life commented Jul 7, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8426a99 into rapidsai:branch-22.08 Jul 7, 2022
@res-life res-life deleted the fix-false-leak-error branch July 7, 2022 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Java Affects Java cuDF API. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants