-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow checkpoint and restore on non-deterministic expressions in GpuFilter and GpuProject #9287
Allow checkpoint and restore on non-deterministic expressions in GpuFilter and GpuProject #9287
Conversation
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really just some nits on the docs, and they are only nits because I am thinking ahead to when we hopefully allow this for GpuUdfs.
package com.nvidia.spark; | ||
|
||
/** | ||
* An interface that can be used by Retry framework of RAPIDS Plugin to handle the GPU OOMs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too much detail for an end user. Could we adjust it a bit? Perhaps something more like.
An interface that can be used to retry the processing on non-deterministic expressions on the GPU.
GPU memory is a limited resource. When it runs out the RAPIDS Accelerator for Apache Spark will
use several different strategies to try and free more GPU so the query can complete. One of these
strategies is to roll back the processioning for one task, pause that tasks thread, than then retry
the task when more memory is available. This works transparently for any stateless deterministic
processing. But technically an expression/UDF can be non-deterministic and/or keep state in
between calls. This interface provides a checkpoint method to save any needed state, and a
restore method to reset the state in the case of a retry. Please note that a retry is not isolated to
a single expression, so a restore can be called even after the expression returned
one or more batches of results.
Each time checkpoint it called any previously saved state can be overwritten.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done
*/ | ||
public interface Retryable { | ||
/** | ||
* Save the state, so it can be restored in case of an OOM Retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we drop the OOM here?
Save the state so it can be restored in the case of a retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
fix #7865
This PR is to support
checkpoint
andrestore
on non-deterministic expressions inGpuFilter
andGpuProject
.Retryable
in Java to supprt Java UDF in the future for end users. A non-deterministic expression can implement this interface to make it retryable.RapidsXORShiftRandom
to provide access to the internal hashed seed. This is used byGpuRand
to implement theRetryable
interface.CheckpointRestore
trait, replaced by the newRetryable
interface.GpuRand
to supportcheckpoint
andrestore
.GpuExpression
, they areselfNonDeterministic
andretryable
.retryable
is used to tell whether an expression is retryable. It will cover its children. WhileselfNonDeterministic
indicates whether an expression itself is non-deterministic when its "deterministic" is false, excluding its children. An expression is actually a tree, anddeterministic
being false means there is at least one tree node is non-deterministic, but we need to know the exact nodes which are non-deterministic to check if it implements theRetryable
. SoselfNonDeterministic
is created.TODO