Simplified handling of GPU core dumps (#9238)

* Simplified handling of GPU core dumps Signed-off-by: Jason Lowe <jlowe@nvidia.com> * scalastyle fix * Fix config visibility * Wait for in-progress GPU core dumps when shutting down due to fatal error * Move dump messages to API module Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add dev documentation for GPU core dumps * Add TODO for leveraging CUDA 12.1 core dump APIs * Add GPU core dump failure log on driver --------- Signed-off-by: Jason Lowe <jlowe@nvidia.com>
NVIDIA · Oct 4, 2023 · 54f5073 · 54f5073
1 parent d340f2e
commit 54f5073
Show file tree

Hide file tree

Showing 7 changed files with 575 additions and 0 deletions.
diff --git a/docs/dev/gpu-core-dumps.md b/docs/dev/gpu-core-dumps.md
@@ -0,0 +1,89 @@
+---
+layout: page
+title: GPU Core Dumps
+nav_order: 9
+parent: Developer Overview
+---
+# GPU Core Dumps
+
+## Overview
+
+When the GPU segfaults and generates an illegal access exception, it can be difficult to know
+what the GPU was doing at the time of the exception. GPU operations execute asynchronously, so what
+the CPU was doing at the time the GPU exception was noticed often has little to do with what
+triggered the exception. GPU core dumps can provide useful clues when debugging these errors, as
+they contain the state of the GPU at the time the exception occurred on the GPU.
+
+The GPU driver can be configured to write a GPU core dump when the GPU segfaults via environment
+variable settings for the process. The challenges for the RAPIDS Accelerator use case are getting
+the environment variables set on the executor processes and then copying the GPU core dump file
+to a distributed filesystem after it is generated on the local filesystem by the driver.
+
+## Environment Variables
+
+The following environment variables are useful for controlling GPU core dumps. See the
+[GPU core dump support section of the CUDA-GDB documentation](https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-core-dump-support)
+for more details.
+
+### `CUDA_ENABLE_COREDUMP_ON_EXCEPTION`
+
+Set to `1` to trigger a GPU core dump on a GPU exception.
+
+### `CUDA_COREDUMP_FILE`
+
+The filename to use for the GPU core dump file. Relative paths to the process current working
+directory are supported. The pattern `%h` in the filename will be expanded to the hostname, and
+the pattern `%p` will be expanded to the process ID. If the filename corresponds with a named pipe,
+the GPU core dump data will be written to the named pipe by the GPU driver.
+
+### `CUDA_ENABLE_LIGHTWEIGHT_COREDUMP`
+
+Set to `1` to generate a lightweight core dump that omits the local, shared, and global memory
+dumps. Disabled by default. Lightweight core dumps still show the code location that triggered
+the exception and therefore can be a good option when one only needs to know what kernel(s) were
+running at the time of the exception and which one triggered the exception.
+
+### `CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION`
+
+Set to `0` to prevent the GPU driver from causing a CPU core dump of the process after the GPU
+core dump is written. Enabled by default.
+
+### `CUDA_COREDUMP_SHOW_PROGRESS`
+
+Set to `1` to print progress messages to the process stderr as the GPU core dump is generated. This
+is only supported on newer GPU drivers (e.g.: those that are CUDA 12 compatible).
+
+## YARN Log Aggregation
+
+The log aggregation feature of YARN can be leveraged to copy GPU core dumps to the same place that
+YARN collects container logs. When enabled, YARN will collect all files in a container's log
+directory to a distributed filesystem location. YARN will automatically expand the pattern
+`<LOG_DIR>` in a container's environment variables to the container's log directory which is useful
+when configuring `CUDA_COREDUMP_FILE` to place the GPU core dump in the appropriate place for
+log aggregation. Note that YARN log aggregation may be configured to have relatively low file size
+limits which may interfere with successful collection of large GPU core dump files.
+
+The following Spark configuration settings will enable GPU lightweight core dumps and have the
+core dump files placed in the container log directory:
+
+```text
+spark.executorEnv.CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
+spark.executorEnv.CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1
+spark.executorEnv.CUDA_COREDUMP_FILE="<LOG_DIR>/executor-%h-%p.nvcudmp"
+```
+
+## Simplified Core Dump Handling
+
+There is rudimentary support for simplified setup of GPU core dumps in the RAPIDS Accelerator.
+This currently only works on Spark standalone clusters, since there is currently no way for a driver
+plugin to programmatically override executor environment variable settings for Spark-on-YARN or
+Spark-on-Kubernetes. In the future with a driver that is compatible with CUDA 12.1 or later,
+the RAPIDS Accelerator could leverage GPU driver APIs to programmatically configure GPU core dump
+support on executor startup.
+
+To enable the simplified core dump handling, set `spark.rapids.gpu.coreDump.dir` to a directory to
+use for GPU core dumps. Distributed filesystem URIs are supported. This leverages named pipes and
+background threads to copy the GPU core dump data to the distributed filesystem. Note that anything
+that causes early, abrupt termination of the process such as throwing from a C++ destructor will
+often terminate the process before the dump write can be completed. These abrupt terminations should
+be fixed when discovered.
diff --git a/sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpMsg.scala b/sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpMsg.scala
@@ -0,0 +1,27 @@
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package com.nvidia.spark.rapids
+
+trait GpuCoreDumpMsg
+
+/** Serialized message sent from executor to driver when a GPU core dump starts */
+case class GpuCoreDumpMsgStart(executorId: String, dumpPath: String) extends GpuCoreDumpMsg
+
+/** Serialized message sent from executor to driver when a GPU core dump completes */
+case class GpuCoreDumpMsgCompleted(executorId: String, dumpPath: String) extends GpuCoreDumpMsg
+
+/** Serialized message sent from executor to driver when a GPU core dump fails */
+case class GpuCoreDumpMsgFailed(executorId: String, error: String) extends GpuCoreDumpMsg
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpHandler.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpHandler.scala
@@ -0,0 +1,194 @@
+/*
+ * Copyright (c) 2023, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package com.nvidia.spark.rapids
+
+import java.io.{File, PrintWriter}
+import java.lang.management.ManagementFactory
+import java.nio.file.Files
+import java.util.concurrent.{Executors, ExecutorService, TimeUnit}
+
+import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource}
+import com.nvidia.spark.rapids.shims.NullOutputStreamShim
+import org.apache.commons.io.IOUtils
+import org.apache.commons.io.output.StringBuilderWriter
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.fs.permission.{FsAction, FsPermission}
+
+import org.apache.spark.SparkContext
+import org.apache.spark.api.plugin.PluginContext
+import org.apache.spark.internal.Logging
+import org.apache.spark.io.CompressionCodec
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.rapids.execution.TrampolineUtil
+import org.apache.spark.util.SerializableConfiguration
+
+object GpuCoreDumpHandler extends Logging {
+  private var executor: Option[ExecutorService] = None
+  private var dumpedPath: Option[String] = None
+  private var namedPipeFile: File = _
+  private var isDumping: Boolean = false
+
+  /**
+   * Configures the executor launch environment for GPU core dumps, if applicable.
+   * Should only be called from the driver on driver startup.
+   */
+  def driverInit(sc: SparkContext, conf: RapidsConf): Unit = {
+    // This only works in practice on Spark standalone clusters. It's too late to influence the
+    // executor environment for Spark-on-YARN or Spark-on-k8s.
+    // TODO: Leverage CUDA 12.1 core dump APIs in the executor to programmatically set this up
+    //       on executor startup. https://github.com/NVIDIA/spark-rapids/issues/9370
+    conf.gpuCoreDumpDir.foreach { _ =>
+      TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_COREDUMP_ON_EXCEPTION", "1")
+      TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION", "0")
+      TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_LIGHTWEIGHT_COREDUMP",
+        if (conf.isGpuCoreDumpFull) "0" else "1")
+      TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_FILE", conf.gpuCoreDumpPipePattern)
+      TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_SHOW_PROGRESS", "1")
+    }
+  }
+
+  /**
+   * Sets up the GPU core dump background copy thread, if applicable.
+   * Should only be called from the executor on executor startup.
+   */
+  def executorInit(rapidsConf: RapidsConf, pluginCtx: PluginContext): Unit = {
+    rapidsConf.gpuCoreDumpDir.foreach { dumpDir =>
+      namedPipeFile = createNamedPipe(rapidsConf)
+      executor = Some(Executors.newSingleThreadExecutor(new ThreadFactoryBuilder()
+        .setNameFormat("gpu-core-copier")
+        .setDaemon(true)
+        .build()))
+      executor.foreach { exec =>
+        val codec = if (rapidsConf.isGpuCoreDumpCompressed) {
+          Some(TrampolineUtil.createCodec(pluginCtx.conf(),
+            rapidsConf.gpuCoreDumpCompressionCodec))
+        } else {
+          None
+        }
+        val suffix = codec.map { c =>
+          "." + TrampolineUtil.getCodecShortName(c.getClass.getName)
+        }.getOrElse("")
+        exec.submit(new Runnable {
+          override def run(): Unit = {
+            try {
+              copyLoop(pluginCtx, namedPipeFile, new Path(dumpDir), codec, suffix)
+            } catch {
+              case _: InterruptedException => logInfo("Stopping GPU core dump copy thread")
+              case t: Throwable => logWarning("Error in GPU core dump copy thread", t)
+            }
+          }
+        })
+      }
+    }
+  }
+
+  /**
+   * Wait for a GPU dump in progress, if any, to complete
+   * @param timeoutSecs maximum amount of time to wait before returning
+   * @return true if the wait timedout, false otherwise
+   */
+  def waitForDump(timeoutSecs: Int): Boolean = {
+    val endTime = System.nanoTime + TimeUnit.SECONDS.toNanos(timeoutSecs)
+    while (isDumping && System.nanoTime < endTime) {
+      Thread.sleep(10)
+    }
+    System.nanoTime < endTime
+  }
+
+  def shutdown(): Unit = {
+    executor.foreach { exec =>
+      exec.shutdownNow()
+      executor = None
+      namedPipeFile.delete()
+      namedPipeFile = null
+    }
+  }
+
+  def handleMsg(msg: GpuCoreDumpMsg): AnyRef = msg match {
+    case GpuCoreDumpMsgStart(executorId, dumpPath) =>
+      logError(s"Executor $executorId starting a GPU core dump to $dumpPath")
+      val spark = SparkSession.active
+      new SerializableConfiguration(spark.sparkContext.hadoopConfiguration)
+    case GpuCoreDumpMsgCompleted(executorId, dumpPath) =>
+      logError(s"Executor $executorId wrote a GPU core dump to $dumpPath")
+      null
+    case GpuCoreDumpMsgFailed(executorId, error) =>
+      logError(s"Executor $executorId failed to write a GPU core dump: $error")
+      null
+    case m =>
+      throw new IllegalStateException(s"Unexpected GPU core dump msg: $m")
+  }
+
+  // visible for testing
+  def getNamedPipeFile: File = namedPipeFile
+
+  private def createNamedPipe(conf: RapidsConf): File = {
+    val processName = ManagementFactory.getRuntimeMXBean.getName
+    val pidstr = processName.substring(0, processName.indexOf("@"))
+    val pipePath = conf.gpuCoreDumpPipePattern.replace("%p", pidstr)
+    val pipeFile = new File(pipePath)
+    val mkFifoProcess = Runtime.getRuntime.exec(Array("mkfifo", "-m", "600", pipeFile.toString))
+    require(mkFifoProcess.waitFor(10, TimeUnit.SECONDS), "mkfifo timed out")
+    pipeFile.deleteOnExit()
+    pipeFile
+  }
+
+  private def copyLoop(
+      pluginCtx: PluginContext,
+      namedPipe: File,
+      dumpDirPath: Path,
+      codec: Option[CompressionCodec],
+      suffix: String): Unit = {
+    val executorId = pluginCtx.executorID()
+    try {
+      logInfo(s"Monitoring ${namedPipe.getAbsolutePath} for GPU core dumps")
+      withResource(new java.io.FileInputStream(namedPipe)) { in =>
+        isDumping = true
+        val appId = pluginCtx.conf.get("spark.app.id")
+        val dumpPath = new Path(dumpDirPath,
+          s"gpucore-$appId-$executorId.nvcudmp$suffix")
+        logError(s"Generating GPU core dump at $dumpPath")
+        val hadoopConf = pluginCtx.ask(GpuCoreDumpMsgStart(executorId, dumpPath.toString))
+          .asInstanceOf[SerializableConfiguration].value
+        val dumpFs = dumpPath.getFileSystem(hadoopConf)
+        val bufferSize = hadoopConf.getInt("io.file.buffer.size", 4096)
+        val perms = new FsPermission(FsAction.READ_WRITE, FsAction.NONE, FsAction.NONE)
+        val fsOut = dumpFs.create(dumpPath, perms, false, bufferSize,
+          dumpFs.getDefaultReplication(dumpPath), dumpFs.getDefaultBlockSize(dumpPath), null)
+        val out = closeOnExcept(fsOut) { _ =>
+          codec.map(_.compressedOutputStream(fsOut)).getOrElse(fsOut)
+        }
+        withResource(out) { _ =>
+          IOUtils.copy(in, out)
+        }
+        dumpedPath = Some(dumpPath.toString)
+        pluginCtx.send(GpuCoreDumpMsgCompleted(executorId, dumpedPath.get))
+      }
+    } catch {
+      case e: Exception =>
+        logError("Error copying GPU dump", e)
+        val writer = new StringBuilderWriter()
+        e.printStackTrace(new PrintWriter(writer))
+        pluginCtx.send(GpuCoreDumpMsgFailed(executorId, s"$e\n${writer.toString}"))
+    } finally {
+      isDumping = false
+    }
+    // Always drain the pipe to avoid blocking the thread that triggers the coredump
+    while (namedPipe.exists()) {
+      Files.copy(namedPipe.toPath, NullOutputStreamShim.INSTANCE)
+    }
+  }
+}
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala
@@ -269,6 +269,7 @@ class RapidsDriverPlugin extends DriverPlugin with Logging {
             s"Rpc message $msg received, but shuffle heartbeat manager not configured.")
         }
         rapidsShuffleHeartbeatManager.executorHeartbeat(id)
+      case m: GpuCoreDumpMsg => GpuCoreDumpHandler.handleMsg(m)
       case m => throw new IllegalStateException(s"Unknown message $m")
     }
   }
@@ -279,6 +280,7 @@ class RapidsDriverPlugin extends DriverPlugin with Logging {
     RapidsPluginUtils.fixupConfigsOnDriver(sparkConf)
     val conf = new RapidsConf(sparkConf)
     RapidsPluginUtils.logPluginMode(conf)
+    GpuCoreDumpHandler.driverInit(sc, conf)
 
     if (GpuShuffleEnv.isRapidsShuffleAvailable(conf)) {
       GpuShuffleEnv.initShuffleManager()
@@ -351,6 +353,8 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
         }
       }
 
+      GpuCoreDumpHandler.executorInit(conf, pluginContext)
+
       // we rely on the Rapids Plugin being run with 1 GPU per executor so we can initialize
       // on executor startup.
       if (!GpuDeviceManager.rmmTaskInitEnabled) {
@@ -475,6 +479,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
     Option(rapidsShuffleHeartbeatEndpoint).foreach(_.close())
     extraExecutorPlugins.foreach(_.shutdown())
     FileCache.shutdown()
+    GpuCoreDumpHandler.shutdown()
   }
 
   override def onTaskFailed(failureReason: TaskFailedReason): Unit = {
@@ -487,6 +492,7 @@ class RapidsExecutorPlugin extends ExecutorPlugin with Logging {
           case Some(e) if containsCudaFatalException(e) =>
             logError("Stopping the Executor based on exception being a fatal CUDA error: " +
               s"${ef.toErrorString}")
+            GpuCoreDumpHandler.waitForDump(timeoutSecs = 60)
             logGpuDebugInfoAndExit(systemExitCode = 20)
           case Some(_: CudaException) =>
             logDebug(s"Executor onTaskFailed because of a non-fatal CUDA error: " +