Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow checkpoint and restore on non-deterministic expressions in GpuFilter and GpuProject #9287

Merged
merged 6 commits into from
Sep 28, 2023

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Sep 22, 2023

fix #7865

This PR is to support checkpoint and restore on non-deterministic expressions in GpuFilter and GpuProject.

  • Introduce a new interface named Retryable in Java to supprt Java UDF in the future for end users. A non-deterministic expression can implement this interface to make it retryable.
  • Introduce a new class named RapidsXORShiftRandom to provide access to the internal hashed seed. This is used by GpuRand to implement the Retryable interface.
  • Removed the useless CheckpointRestore trait, replaced by the new Retryable interface.
  • Update GpuRand to support checkpoint and restore.
  • Introduce two new memebers in GpuExpression, they are selfNonDeterministic and retryable. retryable is used to tell whether an expression is retryable. It will cover its children. While selfNonDeterministic indicates whether an expression itself is non-deterministic when its "deterministic" is false, excluding its children. An expression is actually a tree, and deterministic being false means there is at least one tree node is non-deterministic, but we need to know the exact nodes which are non-deterministic to check if it implements the Retryable. So selfNonDeterministic is created.

TODO

  • add tests

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <firestarmanllc@gmail.com>
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Sep 25, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just some nits on the docs, and they are only nits because I am thinking ahead to when we hopefully allow this for GpuUdfs.

package com.nvidia.spark;

/**
* An interface that can be used by Retry framework of RAPIDS Plugin to handle the GPU OOMs.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too much detail for an end user. Could we adjust it a bit? Perhaps something more like.

An interface that can be used to retry the processing on non-deterministic expressions on the GPU.
GPU memory is a limited resource. When it runs out the RAPIDS Accelerator for Apache Spark will
use several different strategies to try and free more GPU so the query can complete. One of these
strategies is to roll back the processioning for one task, pause that tasks thread, than then retry
the task when more memory is available. This works transparently for any stateless deterministic 
processing. But technically an expression/UDF can be non-deterministic and/or keep state in
between calls. This interface provides a checkpoint method to save any needed state, and a
restore method to reset the state in the case of a retry. Please note that a retry is not isolated to
a single expression, so a restore can be called even after the expression returned
one or more batches of results.

Each time checkpoint it called any previously saved state can be overwritten.

Copy link
Collaborator Author

@firestarman firestarman Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

*/
public interface Retryable {
/**
* Save the state, so it can be restored in case of an OOM Retry.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we drop the OOM here?

Save the state so it can be restored in the case of a retry.

Copy link
Collaborator Author

@firestarman firestarman Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done

@sameerz sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Sep 25, 2023
Signed-off-by: Firestarman <firestarmanllc@gmail.com>
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@firestarman firestarman merged commit 2db7561 into NVIDIA:branch-23.10 Sep 28, 2023
29 checks passed
@firestarman firestarman deleted the retry-non-deterministic branch September 28, 2023 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow checkpoint and restart on non-deterministic expressions
3 participants