-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak in time zone DB #1689
Conversation
Signed-off-by: Chong Gao <res_life@163.com>
build |
My debug logs:
There are 2 times of |
// Recreate a new instance to reload the database if necessary | ||
instance = new GpuTimeZoneDB(); | ||
public synchronized void shutdownImpl() { | ||
if (!cacheEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a potential race condition here where a shutdown occurs very quickly after a startup, so quickly (or database load is so slow) that we don't think the cache is there but it will be once the async thread completes? Seems like we need to also check if there's a pending async load and wait on it if that's the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caches APIs are synchronized, so multiple callings will be serial and wait other to be done.
private synchronized void cacheDatabaseImpl() {
private synchronized void shutdownImpl() {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synchronized
doesn't solve the race I'm talking about. All synchronized
does is guarantee two threads aren't in that method at the same time, but it does nothing to ensure what order those threads call it. If two or more threads are trying to do the same thing (e.g.; initialize the cache), we're fine, since we don't really care about the order of the threads since the operation they are performing is identical and idempotent. However if the threads are doing different operations, then we care very much about the order of operations.
If one thread is trying to start the cache and the other is trying to shutdown, we could leave the database in a cached state despite trying to shut it down. Here's a simplified example:
- On startup, we launch the background thread to initialize the cache
- Before the background thread is able to boot up and call the cache init method, the shutdown method is called (e.g.: very quick test that doesn't involve time DB and tears down on test completion, error on startup, whatever)
- shutdown is called first, notices the cache is not initialized, does nothing.
- background thread finally gets around to calling cache init (note there is no guarantee how quick this happens) and proceeds to initialize the cache
- Now we're in a state where the application thinks the time DB cache has been shutdown and freed, but in reality it has not and we've "leaked" it in a sense.
One way to solve this is to set a boolean flag before we start the background thread. The background caching thread would atomically clear this flag and call the cache init method when it runs. The shutdown method would check for this flag being set, and if it is, release the lock and wait on the background thread to complete before proceeding to retry the shutdown. We can optimize it a bit by having shutdown set a flag after the cache clear has occurred indicating a shutdown has happened. The background init thread can check for this flag being set and skip cache init when that happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler approach is to have a boolean flag indicating a shutdown has occurred. Caching the database will clear this flag. The background thread grabs the lock and avoids initializing the cache if the shutdown flag has been set. The shutdown method always wait on the background init Thread (outside of the lock) if it's not null, setting it to null afterwards and sets the shutdown flag one the cache is cleared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache is a global singleton, and we do not want to shutdown it unless the Plugin is shutdown.
It’s almost not a scenario that one thread is caching cache and other is shutdown the cache concurrently.
Currently test cases do not call cacheDatabase/shutdown concurrently, see the following:
public class TimeZoneTest {
@BeforeAll
static void cacheTimezoneDatabase() {
GpuTimeZoneDB.cacheDatabase();
}
@AfterAll
static void cleanup() {
GpuTimeZoneDB.shutdown();
}
One potential corner case is:
Another time zone test like TimeZoneTest2
doing the same like TimeZoneTest
.
The following sequence will cause stale cache as you mentioned.
- TimeZoneTest.cache
- do TimeZoneTest.test1
- TimeZoneTest2.cache
- TimeZoneTest.shutdown
- do TimeZoneTest2.test2: cache is shutdown, will cause a problem.
- TimeZoneTest2.shutdown
Typically production usage is:
Call cacheDatabaseAsync
start the Plugin
Call shutdown
when stop the Plugin.
And another solution:
Remove the shutdown
and mark the host memory column as noWarnLeakExpected, this will disable the cache leak warning for the global singleton host column.When the JVM exits the host memory column will be cleaned.
Do we need to do the implementation you mentioned for the rare use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not worried about the production use-case but rather the potential corner cases with tests that may shutdown and restart the database. It's really not that hard to fix this race condition (I posted the details for two approaches above, happy to provide the patch for either), so why not close that race? Even if we can prove to ourselves it's never going to be a race today, it doesn't guarantee someone won't change the way this is used to expose the race in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, I'll fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to solve this is to set a boolean flag before we start the background thread. The background caching thread would atomically clear this flag and call the cache init method when it runs. The shutdown method would check for this flag being set, and if it is, release the lock and wait on the background thread to complete before proceeding to retry the shutdown. We can optimize it a bit by having shutdown set a flag after the cache clear has occurred indicating a shutdown has happened. The background init thread can check for this flag being set and skip cache init when that happens.
Now use this approach you mentioned.
Please review.
Changes are:
- Move all sync code into static methods.
- You mentioned changes
build |
40c7af3
to
1ee15bd
Compare
build |
1 similar comment
build |
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relatively minor nits. I think the locking is overly complicated and would like to see it simplified, but I also think it technically works as written so won't block it.
* `cacheDatabaseAsync`, `cacheDatabase` and `shutdown` are synchronized. | ||
* When cacheDatabaseAsync is running, the `shutdown` and `cacheDatabase` will wait; | ||
* These APIs guarantee only one thread is loading transitions cache, | ||
* And guarantee loading cache only occurs one time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not technically true and not desired in the unit testing environment. What's guaranteed here is that we're only going to try to asynchronously load the cache once, but synchronous cache loads (e.g.: cache API called after a shutdown) are still going to work.
static class LoadingLock { | ||
Boolean isLoading = false; | ||
|
||
// record whether a shutdown is called ever. | ||
// if `isCloseCalledEver` is true, then the following loading should be skipped. | ||
Boolean isShutdownCalledEver = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need a separate lock class for this rather than locking the whole database via synchronized methods as was done previously. We just need to lock the GpuTimeZoneDB instance when manipulating the boolean flags which can be direct members of the database. cacheDatabaseAsync, cacheDatabaseImpl and shutdown can just be synchronized methods, which removes the need for most of them to wait on threads -- the wait will be implicit in obtaining the database lock for the synchronized method. The only wait we need is in the shutdown method, and only when isLoading is true.
By leaving the methods synchronized rather than having a separate lock class, it's a little easier to reason about the synchronization safety because we don't have to keep track of whether we're holding LoadingLock at the right times.
closes NVIDIA/spark-rapids#10169
closes #1571
root cases
My debug logs show there are two times running for
doLoadCache
The implementation causes doLoadCache running two times, so here leaks.
here is not synced:
When running case
test_cast_timestamp_to_date
, there are multiple threads running thefromUtcTimestampToTimestamp
concurrently, sodoLoadCache
runs multiple times.link
Above code starts a new thread, and the sync scope breaks.
changes
GpuTimeZoneDB
, because it's a synced object. final makes it more safe.cacheDatabaseAsync
to let plugin invokes.cacheEmpty
bool to avoid multiple caching the transition rules.Now there are no leaks, because
shutdown
will close the singleton memory.Signed-off-by: Chong Gao res_life@163.com