Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

Closed
jlowe opened this issue Mar 23, 2022 · 5 comments · Fixed by #5350
Closed

[FEA] Stop running task attempts on executors that encounter "sticky" CUDA errors #5029

jlowe opened this issue Mar 23, 2022 · 5 comments · Fixed by #5350
Assignees
Labels
P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin

Comments

@jlowe
Copy link
Member

jlowe commented Mar 23, 2022

Is your feature request related to a problem? Please describe.
Certain CUDA errors, like illegal memory access, are "sticky," meaning that all CUDA operations to the GPU after the error will continue to return the same error over and over. No GPU operations will succeed after that point.

Describe the solution you'd like
The RAPIDS Accelerator should take measures to prevent further task execution on the executor once these "sticky" exceptions are detected. Tearing down the executor process is probably the best option, at least in the short-term. Without an external shuffle handler we will lose the shuffle of tasks that have completed, but this is probably a better way to "fail fast" then allow the executor to keep accepting new tasks only to have them fail the first time they touch the GPU.

@jlowe jlowe added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 23, 2022
@sperlingxx
Copy link
Collaborator

sperlingxx commented Mar 25, 2022

Hi @jlowe, I have a rough idea on this issue: failing fast through TaskFailureListener.

  1. Add TaskFailureListener at the entries of GPU processing, such as: GpuScan, GpuColumnarToRow, GpuShuffleCoalesce. Skip adding if current tast context has already included one.
  2. The listener analyzes errors, and calls system.exit if the error stack contains any "sticky" CUDA error. We can add a config to represent the max depth of error stack we will search for the "sticky" errors, just like spark.executor.killOnFatalError.depth.
  3. We also need to list common "sticky" errors and figure out how to capture them.

@jlowe
Copy link
Member Author

jlowe commented Mar 25, 2022

I think using the ExecutorPlugin.onTaskFailure interface would be a cleaner approach, as it avoids needing to ensure we install it in every task attempt. This API was added in 3.1.0 which is now our minimum supported version. Our ExecutorPlugin implementation could override this function and match on the TaskFailedReason parameter to handle ExceptionFailure classes which can provide the Throwable that killed the task. We can then walk the chain of exceptions looking for specific exception types (and possibly messages within those types to further discriminate).

Ideally we should update the cudf bindings to throw a different type of exception for these sticky exceptions which will make them easier to classify in Java/Scala code. There's centralized code in the cudf Java bindings where this mapping can take place (i.e.: the CATCH_STD macro and underlying utility methods).

As for which errors are "sticky" that should be primarily driven from the CUDA documentation on CUDA error codes. For example, any error would be considered "sticky" if it has this text in the description:

This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

I would also add CudaErrorUnknown to that list to be on the safe side.

@sperlingxx sperlingxx self-assigned this Mar 28, 2022
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Mar 29, 2022
@tgravescs
Copy link
Collaborator

may also want to add cudaErrorECCUncorrectable to that list.

@tgravescs
Copy link
Collaborator

A couple of examples of the Exceptions:

22/03/14 06:00:57 ERROR Executor: Exception in task 2195.0 in stage 7.0 (TID 5960)
ai.rapids.cudf.CudfException: CUDA error encountered at: ../src/io/utilities/hostdevice_vector.hpp:57: 999 cudaErrorUnknown unknown error
        at ai.rapids.cudf.Table.readParquet(Native Method)
        at ai.rapids.cudf.Table.readParquet(Table.java:1006)
        at com.nvidia.spark.rapids.ParquetPartitionReader.$anonfun$readToTable$1(GpuParquetScanBase.scala:1535)
22/03/28 14:00:05 ERROR Executor: Exception in task 59755.0 in stage 6.0 (TID 62795)
ai.rapids.cudf.CudfException: CUDA error encountered at: ../src/io/utilities/hostdevice_vector.hpp:57: 214 cudaErrorECCUncorrectable uncorrectable ECC error encountered
        at ai.rapids.cudf.Table.readParquet(Native Method)
        at ai.rapids.cudf.Table.readParquet(Table.java:1006)
        at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToTable$4(GpuParquetScanBase.scala:1397)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)
        at com.nvidia.spark.rapids.MultiFileCloudParquetPartitionReader.$anonfun$readBufferToTable$3(GpuParquetScanBase.scala:1396)
        at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
        at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
        at com.nvidia.spark.rapids.FilePartitionReaderBase.withResource(GpuMultiFileReader.scala:236)

@revans2 revans2 added P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 12, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Apr 24, 2022
This PR is for NVIDIA/spark-rapids#5029  and NVIDIA/spark-rapids#1870, which enables cuDF JNI to throw CUDA errors with specific error code.  This PR relies on #10630, which exposes the CUDA error code and distinguishes fatal CUDA errors from the others.

With this improvement, it is supposed to be easier to track CUDA errors triggered by JVM APIs.

Authors:
  - Alfred Xu (https://github.com/sperlingxx)

Approvers:
  - Jason Lowe (https://github.com/jlowe)

URL: #10551
@sameerz sameerz added this to the May 2 - May 20 milestone Apr 29, 2022
@sameerz
Copy link
Collaborator

sameerz commented Jun 1, 2022

Moving to 22.08 as there are cudf dependencies that will be in 22.08

@sameerz sameerz removed the feature request New feature or request label Jun 3, 2022
@sameerz sameerz modified the milestones: May 23 - Jun 3, Jun 6 - Jun 17 Jun 8, 2022
sperlingxx added a commit that referenced this issue Jun 13, 2022
Closes #5029

Detects unrecoverable (fatal) CUDA errors through the cuDF utility, which applys a more comprehensive way to determine whether a CUDA error is fatal or not.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants