[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

GaryShen2008 · 2020-10-15T14:40:34Z

Describe the bug
When using the Legacy Shuffle, the SQL plan showed as below.

The "total time total" is less than the "shuffle write time" + "fetch wait time".
It may confuse the user.

Steps/Code to reproduce bug
Run any ETL app with the legacy shuffle.

Expected behavior
Make the metrics name correct or find a way to show the correct value of total time.

Environment details (please complete the following information)

Environment location: Standalone, Any cluster type can reproduce, I think.
Spark configuration settings related to the issue. No related settings.

Additional context
None

JustPlay · 2020-10-15T14:44:49Z

some question about metics in scan node

GpuScan parquet data_3t.store_sales

number of files read: 1,824
scan time total (min, med, max (stageId: taskId))
1.03 h (676 ms, 3.1 s, 11.1 s (stage 3.0: task 4092))
number of output columnar batches: 1,852
peak device memory total (min, med, max (stageId: taskId))
665.0 GiB (270.4 MiB, 616.6 MiB, 1599.0 MiB (stage 3.0: task 3833))
GPU decode time total (min, med, max (stageId: taskId))
6.1 m (76 ms, 297 ms, 2.9 s (stage 3.0: task 3967))
metadata time: 2.9 s
size of files read: 1536.6 GiB
number of output rows: 28,799,975,831
total time total (min, med, max (stageId: taskId))
54.4 m (530 ms, 2.7 s, 10.6 s (stage 3.0: task 4177))
number of partitions read: 1,824
buffer time total (min, med, max (stageId: taskId))
49.8 m (389 ms, 2.1 s, 22.3 s (stage 3.0: task 4662))

for scan node

what scan time, GPU decode time, total time, buffer time mean?
why total_time < scan_time?

Thanks

andygrove · 2020-10-16T21:21:05Z

@GaryShen2008 I looked into this and I agree that the value we are currently reporting for total time for shuffles is confusing (it is actually measuring the time to create an internal iterator which is misleading). Given that Spark doesn't report this metric for shuffles I propose that we remove this metric. I will create a PR.

jlowe · 2020-10-16T21:56:29Z

@JustPlay apologies for the delay, here's some answers for your querstions.

what scan time, GPU decode time, total time, buffer time mean?

scan time is a standard Spark metric measuring the time it takes the underlying FileScanRDD to produce the batches, i.e.: the time needed to scan the input source. See https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L484-L492 for code details.
GPU decode time measures the time spent waiting for the libcudf call to complete. This measures the amount of time it took the GPU to decode the data once it was loaded into host memory.
total time is a metric added to many GPU exec nodes measuring the time spent processing in that node. In the case of scan nodes, it's fairly redundant with scan time metric.
buffer time measures the amount of time spent loading the data from the distributed filesystem into host memory. Note that when multithreaded reads are enabled via spark.rapids.sql.format.parquet.multiThreadedRead.enabled then you can see more time spent in buffer time than is reflected in the wall-clock time measured by scan time and total time. This is similar to how user time can exceed real time when using the time command to measure commands.

why total_time < scan_time?

The latter is a standard Spark metric for scans that is measuring time at a very high level for the node, i.e.: the hasNext method of doExecuteColumnar. The former is a metric added by the plugin that is measuring a strict subset of what scan time measures. In practice they normally should be very close to each other and thus having both metrics isn't adding a lot of value. I noticed that total time is not measuring the time it takes to acquire the GPU semaphore when a task has an empty batch (e.g.: there are no row groups to read in Parquet when processing an input split, so it hasn't grabbed the semaphore as part of normal processing). That may be the reason your scan time and total time aren't closer together.

…IDIA#952) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

GaryShen2008 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 15, 2020

andygrove self-assigned this Oct 16, 2020

jlowe mentioned this issue Oct 16, 2020

[FEA] Add documentation for metrics added by the plugin #971

Closed

jlowe mentioned this issue Oct 16, 2020

[BUG] total time metric is redundant with scan time #972

Closed

andygrove mentioned this issue Oct 16, 2020

Stop reporting totalTime metric for GpuShuffleExchangeExec #973

Merged

jlowe closed this as completed in #973 Oct 19, 2020

sameerz removed the ? - Needs Triage Need team to review and classify label Oct 20, 2020

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Update submodule cudf to 12410d90f54d8f838e5328928f092a692620d628 (NV…

760074a

…IDIA#952) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

GaryShen2008 commented Oct 15, 2020

JustPlay commented Oct 15, 2020

andygrove commented Oct 16, 2020

jlowe commented Oct 16, 2020

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

[BUG]The metrics "total time“ of GpuColumnarExchange is strange #952

Comments

GaryShen2008 commented Oct 15, 2020

JustPlay commented Oct 15, 2020

andygrove commented Oct 16, 2020

jlowe commented Oct 16, 2020