Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On task failure catch some CUDA exceptions and kill executor [databricks] #5118

Merged
merged 10 commits into from
Apr 1, 2022

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Mar 31, 2022

Related to #5029. This is shorter term solution to just parse the exception message to catch certain types of unrecoverable CUDA errors. It may not be bullet proof as the messages could change.

Here if we find an exception that we think is unrecoverable we system.exit to kill the executor.
Generally you would want to use this with the Spark excludeOnFailure functionality so it doesn't start the executor back up using the same GPU.

I've manually tested this by faking the exception occurring since we can't reproduce it. It properly kills the executor when it sees the exception.

Sample code used to cause failures:

sc.range(0, 4, 1, 4).mapPartitions{x =>
  import org.apache.spark.TaskContext
  val tc = TaskContext.get()
  println("task id: " + tc.taskAttemptId)
  
  if (tc.taskAttemptId % 3 == 0) {
    try {
    throw new Exception("cudaErrorHardwareStackError")
    } catch {
      case e: Throwable =>
      throw new RuntimeException(s"CUDA error encountered: ${e.getMessage}", e)
    }
  }
  x.map(x => x)}.collect()

this generates exceptions like:

22/03/31 22:46:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.RuntimeException: CUDA error encountered: cudaErrorHardwareStackError

This new executor plugin code catches and logs the following and then exits:

22/03/31 22:46:57 ERROR RapidsExecutorPlugin: Stopping the Executor based on exception being a fatal CUDA error: java.lang.RuntimeException: CUDA error encountered: cudaErrorHardwareStackError

In standalone mode with the excludeOnFailure spark configs set to 1 for the node exclusion, it when the task fails and this kills the executor, the node with be excluded and the worker will not be able to restart an executor on that node. Also note keep in mind the spark config spark.excludeOnFailure.timeout which will try spark to retry that node after the timeout value.
Without excludeonFailure, the executors just get restarted on the same nodes for standalone mode. I tested on yarn as well and there it will restart executors but it could be on different nodes depending on the size of the cluster.

@tgravescs tgravescs added the bug Something isn't working label Mar 31, 2022
@tgravescs tgravescs added this to the Mar 21 - Apr 1 milestone Mar 31, 2022
@tgravescs tgravescs self-assigned this Mar 31, 2022
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

@tgravescs tgravescs merged commit 207fbfc into NVIDIA:branch-22.04 Apr 1, 2022
@tgravescs tgravescs deleted the catchCudaException branch April 1, 2022 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants