NVIDIA · res-life · Jul 7, 2022 · Jul 5, 2022 · Jul 6, 2022
diff --git a/docs/tuning-guide.md b/docs/tuning-guide.md
@@ -303,7 +303,7 @@ Custom Spark SQL Metrics are available which can help identify performance bottl
 | buildTime         | build time                   | Time to load the build-side of a join.                                                                                                                                             |
 | collectTime       | collect time                 | For a broadcast the amount of time it took to collect the broadcast data back to the driver before broadcasting it back out.                                                       |
 | computeAggTime    | aggregation time             | Time computing an aggregation.                                                                                                                                                     |
-| concatTime        | concat batch time            | Time to concatenate batches.                                                                                                                                                       |
+| concatTime        | concat batch time            | Time to concatenate batches. Runs on CPU.                                                                                                                                                      |
 | copyBufferTime    | copy buffer time             | Time spent on copying upstreaming data into Rapids buffers.                                                                                                                        |
 | filterTime        | filter time                  | Time spent applying filters within other operators, such as joins.                                                                                                                 |
 | gpuDecodeTime     | GPU decode time              | Time spent on GPU decoding encrypted or compressed data.                                                                                                                           |
@@ -314,7 +314,7 @@ Custom Spark SQL Metrics are available which can help identify performance bottl
 | numOutputBatches  | output columnar batches      | Number of columnar batches that the operator outputs.                                                                                                                              |
 | numOutputRows     | output rows                  | Number of rows that the operator outputs.                                                                                                                                          |
 | numPartitions     | partitions                   | Number of output partitions from a file scan or shuffle exchange.                                                                                                                  |
-| opTime            | op time                      | Time that an operator takes, exclusive of the time for executing or fetching results from child operators, and typically outside of the time it takes to acquire the GPU semaphore. |
+| opTime            | op time                      | Time that an operator takes, exclusive of the time for executing or fetching results from child operators, and typically outside of the time it takes to acquire the GPU semaphore. <br/> Note: Sometimes contains CPU times, e.g.: concatTime |
 | partitionSize     | partition data size          | Total size in bytes of output partitions.                                                                                                                                          |
 | peakDevMemory     | peak device memory           | Peak GPU memory used during execution of an operator.                                                                                                                              |
 | semaphoreWaitTime | GPU semaphore wait time      | Time spent waiting for the GPU semaphore.                                                                                                                                          |
@@ -341,6 +341,9 @@ How long does operator X take on the GPU vs the CPU?
 
 ### Time taken on the GPU
 
+`opTime` mainly convey the GPU time.
+If GPU operators have some workload on CPU, the GPU time is: `opTime` - CPU time, e.g.:
+`opTime` - `concatTime`.
 Nearly all GPU operators will have an `op time` metric. This metric times how long a given
 operation took to complete on the GPU separate from anything upstream or down stream of the
 operator. By looking at the `op time` for each operator you should be able to get a feeling of

diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuShuffleCoalesceExec.scala
@@ -204,12 +204,20 @@ class GpuShuffleCoalesceIterator(iter: Iterator[HostConcatResult],
       throw new NoSuchElementException("No more columnar batches")
     }
     withResource(new NvtxRange("Concat+Load Batch", NvtxColor.YELLOW)) { _ =>
-      withResource(iter.next()) { hostConcatResult =>
+      val hostConcatResult = withResource(new MetricRange(opTimeMetric)) { _ =>
+        // op time covers concat time performed in `iter.next()`.
+        // Note the concat runs on CPU.
+        // GPU time = opTime - concatTime
+        iter.next()
+      }
+
+      withResource(hostConcatResult) { _ =>
         // We acquire the GPU regardless of whether `hostConcatResult`
         // is an empty batch or not, because the downstream tasks expect
         // the `GpuShuffleCoalesceIterator` to acquire the semaphore and may
         // generate GPU data from batches that are empty.
         GpuSemaphore.acquireIfNecessary(TaskContext.get(), semWaitTime)
+
         withResource(new MetricRange(opTimeMetric)) { _ =>
           val batch = cudf_utils.HostConcatResultUtil.getColumnarBatch(hostConcatResult, dataTypes)
           outputBatchesMetric += 1