Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error message after failed RMM shutdown #2080

Merged

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Apr 5, 2021

Exceptions during GpuDeviceManager.shutdown can lead to GpuDeviceManager and RapidsBufferCatalog being left in an inconsistent state which results in NullPointerException when accessing uninitialized state.

This PR changes the singletonMemoryInitialized flag from a boolean to a MemoryState data type with possible values of Initialized, Uninitialized, and Uninitializable, with appropriate error checking added.

Signed-off-by: Andy Grove andygrove@nvidia.com

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added the bug Something isn't working label Apr 5, 2021
@andygrove andygrove added this to the Mar 30 - Apr 9 milestone Apr 5, 2021
@andygrove andygrove self-assigned this Apr 5, 2021
@andygrove andygrove linked an issue Apr 6, 2021 that may be closed by this pull request
@andygrove andygrove changed the title WIP: Improve error message after failed RMM shutdown Improve error message after failed RMM shutdown Apr 6, 2021
@andygrove andygrove marked this pull request as ready for review April 6, 2021 13:45
abellina
abellina previously approved these changes Apr 6, 2021
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @andygrove, tiny nit.

case object Uninitialized extends MemoryState
case object Uninitializable extends MemoryState


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space nit.

Signed-off-by: Andy Grove <andygrove@nvidia.com>
andygrove and others added 3 commits April 6, 2021 08:11
…ager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…ager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Signed-off-by: Andy Grove <andygrove@nvidia.com>
@abellina
Copy link
Collaborator

abellina commented Apr 6, 2021

build

@andygrove andygrove merged commit a79b0c8 into NVIDIA:branch-0.5 Apr 6, 2021
@andygrove andygrove deleted the improve-rmm-shutdown-error-handling branch April 6, 2021 20:36
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Improve error message after failed RMM shutdown

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove blank line

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Rename Uninitializable to Errored

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Improve error message after failed RMM shutdown

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Remove blank line

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

* Rename Uninitializable to Errored

Signed-off-by: Andy Grove <andygrove@nvidia.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Intermittent NPE in RapidsBufferCatalog when running test suite
3 participants