-
Notifications
You must be signed in to change notification settings - Fork 232
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Simplified handling of GPU core dumps (#9238)
* Simplified handling of GPU core dumps Signed-off-by: Jason Lowe <jlowe@nvidia.com> * scalastyle fix * Fix config visibility * Wait for in-progress GPU core dumps when shutting down due to fatal error * Move dump messages to API module Signed-off-by: Jason Lowe <jlowe@nvidia.com> * Add dev documentation for GPU core dumps * Add TODO for leveraging CUDA 12.1 core dump APIs * Add GPU core dump failure log on driver --------- Signed-off-by: Jason Lowe <jlowe@nvidia.com>
- Loading branch information
Showing
7 changed files
with
575 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
--- | ||
layout: page | ||
title: GPU Core Dumps | ||
nav_order: 9 | ||
parent: Developer Overview | ||
--- | ||
# GPU Core Dumps | ||
|
||
## Overview | ||
|
||
When the GPU segfaults and generates an illegal access exception, it can be difficult to know | ||
what the GPU was doing at the time of the exception. GPU operations execute asynchronously, so what | ||
the CPU was doing at the time the GPU exception was noticed often has little to do with what | ||
triggered the exception. GPU core dumps can provide useful clues when debugging these errors, as | ||
they contain the state of the GPU at the time the exception occurred on the GPU. | ||
|
||
The GPU driver can be configured to write a GPU core dump when the GPU segfaults via environment | ||
variable settings for the process. The challenges for the RAPIDS Accelerator use case are getting | ||
the environment variables set on the executor processes and then copying the GPU core dump file | ||
to a distributed filesystem after it is generated on the local filesystem by the driver. | ||
|
||
## Environment Variables | ||
|
||
The following environment variables are useful for controlling GPU core dumps. See the | ||
[GPU core dump support section of the CUDA-GDB documentation](https://docs.nvidia.com/cuda/cuda-gdb/index.html#gpu-core-dump-support) | ||
for more details. | ||
|
||
### `CUDA_ENABLE_COREDUMP_ON_EXCEPTION` | ||
|
||
Set to `1` to trigger a GPU core dump on a GPU exception. | ||
|
||
### `CUDA_COREDUMP_FILE` | ||
|
||
The filename to use for the GPU core dump file. Relative paths to the process current working | ||
directory are supported. The pattern `%h` in the filename will be expanded to the hostname, and | ||
the pattern `%p` will be expanded to the process ID. If the filename corresponds with a named pipe, | ||
the GPU core dump data will be written to the named pipe by the GPU driver. | ||
|
||
### `CUDA_ENABLE_LIGHTWEIGHT_COREDUMP` | ||
|
||
Set to `1` to generate a lightweight core dump that omits the local, shared, and global memory | ||
dumps. Disabled by default. Lightweight core dumps still show the code location that triggered | ||
the exception and therefore can be a good option when one only needs to know what kernel(s) were | ||
running at the time of the exception and which one triggered the exception. | ||
|
||
### `CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION` | ||
|
||
Set to `0` to prevent the GPU driver from causing a CPU core dump of the process after the GPU | ||
core dump is written. Enabled by default. | ||
|
||
### `CUDA_COREDUMP_SHOW_PROGRESS` | ||
|
||
Set to `1` to print progress messages to the process stderr as the GPU core dump is generated. This | ||
is only supported on newer GPU drivers (e.g.: those that are CUDA 12 compatible). | ||
|
||
## YARN Log Aggregation | ||
|
||
The log aggregation feature of YARN can be leveraged to copy GPU core dumps to the same place that | ||
YARN collects container logs. When enabled, YARN will collect all files in a container's log | ||
directory to a distributed filesystem location. YARN will automatically expand the pattern | ||
`<LOG_DIR>` in a container's environment variables to the container's log directory which is useful | ||
when configuring `CUDA_COREDUMP_FILE` to place the GPU core dump in the appropriate place for | ||
log aggregation. Note that YARN log aggregation may be configured to have relatively low file size | ||
limits which may interfere with successful collection of large GPU core dump files. | ||
|
||
The following Spark configuration settings will enable GPU lightweight core dumps and have the | ||
core dump files placed in the container log directory: | ||
|
||
```text | ||
spark.executorEnv.CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 | ||
spark.executorEnv.CUDA_ENABLE_LIGHTWEIGHT_COREDUMP=1 | ||
spark.executorEnv.CUDA_COREDUMP_FILE="<LOG_DIR>/executor-%h-%p.nvcudmp" | ||
``` | ||
|
||
## Simplified Core Dump Handling | ||
|
||
There is rudimentary support for simplified setup of GPU core dumps in the RAPIDS Accelerator. | ||
This currently only works on Spark standalone clusters, since there is currently no way for a driver | ||
plugin to programmatically override executor environment variable settings for Spark-on-YARN or | ||
Spark-on-Kubernetes. In the future with a driver that is compatible with CUDA 12.1 or later, | ||
the RAPIDS Accelerator could leverage GPU driver APIs to programmatically configure GPU core dump | ||
support on executor startup. | ||
|
||
To enable the simplified core dump handling, set `spark.rapids.gpu.coreDump.dir` to a directory to | ||
use for GPU core dumps. Distributed filesystem URIs are supported. This leverages named pipes and | ||
background threads to copy the GPU core dump data to the distributed filesystem. Note that anything | ||
that causes early, abrupt termination of the process such as throwing from a C++ destructor will | ||
often terminate the process before the dump write can be completed. These abrupt terminations should | ||
be fixed when discovered. |
27 changes: 27 additions & 0 deletions
27
sql-plugin-api/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpMsg.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
/* | ||
* Copyright (c) 2023, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package com.nvidia.spark.rapids | ||
|
||
trait GpuCoreDumpMsg | ||
|
||
/** Serialized message sent from executor to driver when a GPU core dump starts */ | ||
case class GpuCoreDumpMsgStart(executorId: String, dumpPath: String) extends GpuCoreDumpMsg | ||
|
||
/** Serialized message sent from executor to driver when a GPU core dump completes */ | ||
case class GpuCoreDumpMsgCompleted(executorId: String, dumpPath: String) extends GpuCoreDumpMsg | ||
|
||
/** Serialized message sent from executor to driver when a GPU core dump fails */ | ||
case class GpuCoreDumpMsgFailed(executorId: String, error: String) extends GpuCoreDumpMsg |
194 changes: 194 additions & 0 deletions
194
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCoreDumpHandler.scala
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,194 @@ | ||
/* | ||
* Copyright (c) 2023, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
package com.nvidia.spark.rapids | ||
|
||
import java.io.{File, PrintWriter} | ||
import java.lang.management.ManagementFactory | ||
import java.nio.file.Files | ||
import java.util.concurrent.{Executors, ExecutorService, TimeUnit} | ||
|
||
import com.nvidia.spark.rapids.Arm.{closeOnExcept, withResource} | ||
import com.nvidia.spark.rapids.shims.NullOutputStreamShim | ||
import org.apache.commons.io.IOUtils | ||
import org.apache.commons.io.output.StringBuilderWriter | ||
import org.apache.hadoop.fs.Path | ||
import org.apache.hadoop.fs.permission.{FsAction, FsPermission} | ||
|
||
import org.apache.spark.SparkContext | ||
import org.apache.spark.api.plugin.PluginContext | ||
import org.apache.spark.internal.Logging | ||
import org.apache.spark.io.CompressionCodec | ||
import org.apache.spark.sql.SparkSession | ||
import org.apache.spark.sql.rapids.execution.TrampolineUtil | ||
import org.apache.spark.util.SerializableConfiguration | ||
|
||
object GpuCoreDumpHandler extends Logging { | ||
private var executor: Option[ExecutorService] = None | ||
private var dumpedPath: Option[String] = None | ||
private var namedPipeFile: File = _ | ||
private var isDumping: Boolean = false | ||
|
||
/** | ||
* Configures the executor launch environment for GPU core dumps, if applicable. | ||
* Should only be called from the driver on driver startup. | ||
*/ | ||
def driverInit(sc: SparkContext, conf: RapidsConf): Unit = { | ||
// This only works in practice on Spark standalone clusters. It's too late to influence the | ||
// executor environment for Spark-on-YARN or Spark-on-k8s. | ||
// TODO: Leverage CUDA 12.1 core dump APIs in the executor to programmatically set this up | ||
// on executor startup. https://github.com/NVIDIA/spark-rapids/issues/9370 | ||
conf.gpuCoreDumpDir.foreach { _ => | ||
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_COREDUMP_ON_EXCEPTION", "1") | ||
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_CPU_COREDUMP_ON_EXCEPTION", "0") | ||
TrampolineUtil.setExecutorEnv(sc, "CUDA_ENABLE_LIGHTWEIGHT_COREDUMP", | ||
if (conf.isGpuCoreDumpFull) "0" else "1") | ||
TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_FILE", conf.gpuCoreDumpPipePattern) | ||
TrampolineUtil.setExecutorEnv(sc, "CUDA_COREDUMP_SHOW_PROGRESS", "1") | ||
} | ||
} | ||
|
||
/** | ||
* Sets up the GPU core dump background copy thread, if applicable. | ||
* Should only be called from the executor on executor startup. | ||
*/ | ||
def executorInit(rapidsConf: RapidsConf, pluginCtx: PluginContext): Unit = { | ||
rapidsConf.gpuCoreDumpDir.foreach { dumpDir => | ||
namedPipeFile = createNamedPipe(rapidsConf) | ||
executor = Some(Executors.newSingleThreadExecutor(new ThreadFactoryBuilder() | ||
.setNameFormat("gpu-core-copier") | ||
.setDaemon(true) | ||
.build())) | ||
executor.foreach { exec => | ||
val codec = if (rapidsConf.isGpuCoreDumpCompressed) { | ||
Some(TrampolineUtil.createCodec(pluginCtx.conf(), | ||
rapidsConf.gpuCoreDumpCompressionCodec)) | ||
} else { | ||
None | ||
} | ||
val suffix = codec.map { c => | ||
"." + TrampolineUtil.getCodecShortName(c.getClass.getName) | ||
}.getOrElse("") | ||
exec.submit(new Runnable { | ||
override def run(): Unit = { | ||
try { | ||
copyLoop(pluginCtx, namedPipeFile, new Path(dumpDir), codec, suffix) | ||
} catch { | ||
case _: InterruptedException => logInfo("Stopping GPU core dump copy thread") | ||
case t: Throwable => logWarning("Error in GPU core dump copy thread", t) | ||
} | ||
} | ||
}) | ||
} | ||
} | ||
} | ||
|
||
/** | ||
* Wait for a GPU dump in progress, if any, to complete | ||
* @param timeoutSecs maximum amount of time to wait before returning | ||
* @return true if the wait timedout, false otherwise | ||
*/ | ||
def waitForDump(timeoutSecs: Int): Boolean = { | ||
val endTime = System.nanoTime + TimeUnit.SECONDS.toNanos(timeoutSecs) | ||
while (isDumping && System.nanoTime < endTime) { | ||
Thread.sleep(10) | ||
} | ||
System.nanoTime < endTime | ||
} | ||
|
||
def shutdown(): Unit = { | ||
executor.foreach { exec => | ||
exec.shutdownNow() | ||
executor = None | ||
namedPipeFile.delete() | ||
namedPipeFile = null | ||
} | ||
} | ||
|
||
def handleMsg(msg: GpuCoreDumpMsg): AnyRef = msg match { | ||
case GpuCoreDumpMsgStart(executorId, dumpPath) => | ||
logError(s"Executor $executorId starting a GPU core dump to $dumpPath") | ||
val spark = SparkSession.active | ||
new SerializableConfiguration(spark.sparkContext.hadoopConfiguration) | ||
case GpuCoreDumpMsgCompleted(executorId, dumpPath) => | ||
logError(s"Executor $executorId wrote a GPU core dump to $dumpPath") | ||
null | ||
case GpuCoreDumpMsgFailed(executorId, error) => | ||
logError(s"Executor $executorId failed to write a GPU core dump: $error") | ||
null | ||
case m => | ||
throw new IllegalStateException(s"Unexpected GPU core dump msg: $m") | ||
} | ||
|
||
// visible for testing | ||
def getNamedPipeFile: File = namedPipeFile | ||
|
||
private def createNamedPipe(conf: RapidsConf): File = { | ||
val processName = ManagementFactory.getRuntimeMXBean.getName | ||
val pidstr = processName.substring(0, processName.indexOf("@")) | ||
val pipePath = conf.gpuCoreDumpPipePattern.replace("%p", pidstr) | ||
val pipeFile = new File(pipePath) | ||
val mkFifoProcess = Runtime.getRuntime.exec(Array("mkfifo", "-m", "600", pipeFile.toString)) | ||
require(mkFifoProcess.waitFor(10, TimeUnit.SECONDS), "mkfifo timed out") | ||
pipeFile.deleteOnExit() | ||
pipeFile | ||
} | ||
|
||
private def copyLoop( | ||
pluginCtx: PluginContext, | ||
namedPipe: File, | ||
dumpDirPath: Path, | ||
codec: Option[CompressionCodec], | ||
suffix: String): Unit = { | ||
val executorId = pluginCtx.executorID() | ||
try { | ||
logInfo(s"Monitoring ${namedPipe.getAbsolutePath} for GPU core dumps") | ||
withResource(new java.io.FileInputStream(namedPipe)) { in => | ||
isDumping = true | ||
val appId = pluginCtx.conf.get("spark.app.id") | ||
val dumpPath = new Path(dumpDirPath, | ||
s"gpucore-$appId-$executorId.nvcudmp$suffix") | ||
logError(s"Generating GPU core dump at $dumpPath") | ||
val hadoopConf = pluginCtx.ask(GpuCoreDumpMsgStart(executorId, dumpPath.toString)) | ||
.asInstanceOf[SerializableConfiguration].value | ||
val dumpFs = dumpPath.getFileSystem(hadoopConf) | ||
val bufferSize = hadoopConf.getInt("io.file.buffer.size", 4096) | ||
val perms = new FsPermission(FsAction.READ_WRITE, FsAction.NONE, FsAction.NONE) | ||
val fsOut = dumpFs.create(dumpPath, perms, false, bufferSize, | ||
dumpFs.getDefaultReplication(dumpPath), dumpFs.getDefaultBlockSize(dumpPath), null) | ||
val out = closeOnExcept(fsOut) { _ => | ||
codec.map(_.compressedOutputStream(fsOut)).getOrElse(fsOut) | ||
} | ||
withResource(out) { _ => | ||
IOUtils.copy(in, out) | ||
} | ||
dumpedPath = Some(dumpPath.toString) | ||
pluginCtx.send(GpuCoreDumpMsgCompleted(executorId, dumpedPath.get)) | ||
} | ||
} catch { | ||
case e: Exception => | ||
logError("Error copying GPU dump", e) | ||
val writer = new StringBuilderWriter() | ||
e.printStackTrace(new PrintWriter(writer)) | ||
pluginCtx.send(GpuCoreDumpMsgFailed(executorId, s"$e\n${writer.toString}")) | ||
} finally { | ||
isDumping = false | ||
} | ||
// Always drain the pipe to avoid blocking the thread that triggers the coredump | ||
while (namedPipe.exists()) { | ||
Files.copy(namedPipe.toPath, NullOutputStreamShim.INSTANCE) | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.